Layered scene decomposition codec with higher order lighting

ABSTRACT

A system and methods for a CODEC driving a real-time light field display for multi-dimensional video streaming, interactive gaming and other light field display applications is provided applying a layered scene decomposition strategy. Multi-dimensional scene data including information on directions of normals is divided into a plurality of data layers of increasing depths as the distance between a given layer and the display surface increases. Data layers which are sampled using a plenoptic sampling scheme and rendered using hybrid rendering, such as perspective and oblique rendering, to encode light fields corresponding to each data layer. The resulting compressed, (layered) core representation of the multi-dimensional scene data is produced at predictable rates, reconstructed and merged at the light field display in real-time by applying view synthesis protocols, including edge adaptive interpolation, to reconstruct pixel arrays in stages (e.g. columns then rows) from reference elemental images.

CLAIM OF PRIORITY

This application claims priority to U.S. patent application Serial No.U.S. Ser. No. 16/798,230 filed 21 Feb. 2020, now allowed, which claimspriority to U.S. Patent provisional patent Application Ser. No.62/809,390, filed on Feb. 22, 2019, the contents of both of which areincorporated herein by reference in their entirety.

FIELD OF THE INVENTION

The present disclosure relates to image (light field) data encoding anddecoding, including data compression and decompression systems andmethods for the provision of interactive multi-dimensional content at alight field display.

BACKGROUND OF THE INVENTION

Autostereoscopic, high-angular resolution, wide field of view (FOV),multi-view displays provide users with an improved visual experience. Athree-dimensional display that can pass the 3D Turing Test (described byBanks et al.) will require a light field representation in place of thetwo-dimensional images projected by standard existing displays. Arealistic light field representation requires enormous amounts ofbandwidth to transmit the display data, which will comprise at leastgigapixels of data. These bandwidth requirements currently exceed thebandwidth capabilities provided by technologies previously known in theart; the upcoming consumer video standard is 8K Ultra High-Def (UHD),which provides only 33.1 megapixels of data per display.

Compressing data for transmission is previously known in the art. Datamay be compressed for various types of transmission, such as, but notlimited to: long-distance transmission of data over internet or ethernetnetworks; or transmission of a synthetic multiple-view created by agraphical processing unit (GPU) and transferred to a display device.Such data may be used for video streaming, real-time interactive gaming,or any other light field display.

Several encoder-decoders (CODECs) for compressed light fieldtransmission are previously known in the art. Olsson et al. teachcompression techniques where an entire light field data set is processedto reduce redundancy and produce a compressed representation.Subcomponents (i.e., elemental images) of the light field are treated asa video sequence to exploit redundancy using standard video codingtechniques. Vetro et al. teach multiple-view specializations ofcompression standards that exploit redundancy between the light fieldsubcomponents to achieve better compression rates, but at the expense ofmore intensive processing. These techniques may not achieve a sufficientcompression ratio, and when a good ratio is achieved the encoding anddecoding processes are beyond real-time rates. These approaches assumethat the entire light field exists in a storage disk or memory beforebeing encoded. Therefore, large light field displays requiring largenumbers of pixels introduce excessive latency when reading from astorage medium.

In an attempt to overcome hardware limitations for the delivery ofmulti-dimensional content in real-time, various methods and systems areknown, however, these methods and systems present their own limitations.

U.S. Pat. No. 9,727,970 discloses a distributed, in parallel(multi-processor) computing method and apparatus for generating ahologram by separating 3D image data into data groups, calculating fromthe data groups hologram values to be displayed at different positionson the holographic plane and summing the values for each position forgenerating a holographic display. As a disclosure focused on generatinga holographic display, the strategies applied involve manipulating fineat a smaller scale than light field and in this instance ischaracterized by the sorting and dividing of data according to colour,followed by colour image planes and then further dividing the planeimages into sub-images.

US Patent Publication No. 20170142427 describes content adaptive lightfield compression based on the collapsing of multiple elemental images(hogels) into a single hogel. The disclosure describes achieving aguaranteed compression rate, however, image lossiness varies and incombining hogels as disclosed there is no guarantee of redundancy thatcan be exploited.

US Patent Publication No. 20160360177 describes methods for fullparallax compressed light field synthesis utilizing depth informationand relates to the application of view synthesis methods for creating alight field from a set of elemental images that form a subset of a totalset of elemental images. The view synthesis techniques described hereindo not describe or give methods to handle reconstruction artifactscaused during backwards warping.

US Patent Publication No. 20150201176 describes methods for fullparallax compressed light field 3D imaging systems disclosing thesubsampling of elemental images in a light field based on the distanceof the objects in the scene being captured. Though the methods describethe possibility of down sampling the light field using simple conditionsthat could enhance the speed of encoding, in the worst case 3D scenesexist where no down-sampling would occur, and the encoding would fallback on transform encoding techniques which rely on having the entirelight field to exist prior to encoding.

There remains a need for increased data transmission capabilities,improved data encoder-decoders (CODECs), and methods to achieve bothimproved data transmission and CODEC capabilities for the real-timedelivery of multi-dimensional content to a light field display.

SUMMARY OF THE INVENTION

The present invention relates generally to 3D image data encoding anddecoding for driving a light field display in real-time, which overcomesor can be implemented with present hardware limitations.

It is an object of the present disclosure to provide a CODEC withreduced system transmission latency and high bandwidth rates to providefor the production of a light field, in real time, with good resolution,at a light field display, for application in video streaming, andreal-time interactive gaming. Light field or 3D scene data isdeconstructed into subsets, which may be referred to as layers(corresponding to layered light fields), or data layers, sampled andrendered to compress the data for transmission and then decoded toconstruct and merge light fields corresponding to the data layers at alight field display.

According to an aspect there is computer-implemented method comprising:receiving a first data set comprising a three-dimensional description ofa scene;

the first data set comprising information on directions of normals onsurfaces in the scene,

the directions of the normals represented with respect to a referencedirection, wherein

at least some of the surfaces have non-Lambertian reflection properties;

partitioning the first data set into a plurality of layers eachrepresenting a portion of the scene at a location with respect to areference location; and

encoding multiple layers to generate a second data set, wherein a sizeof the second data set is smaller than a size of the first data set.

Embodiments can include one or more of the following features.

In an embodiment of the method, the second data set is transmitted to aremote device for comprising a display for presenting the scene.

In an embodiment of the method, further comprising presenting the sceneon the display.

In an embodiment of the method, encoding multiple layers comprisesperforming a sampling operation on at least a portion of the first dataset to generate the second data set.

In an embodiment of the method, further comprising performing thesampling operation is based on a target compression rate associated withthe second data set.

In an embodiment of the method, encoding multiple layers comprises:

rendering using ray tracing, a set of pixels to be encoded;

selecting multiple elemental images from a plurality of elemental imagessuch that the set of pixels are rendered using the selected multipleelemental images; and

sampling the set of pixels using a sampling operation.

In an embodiment of the method, performing the sampling operationcomprises selecting multiple elemental images, from a portion of theplurality of elemental images, in accordance with a plenoptic samplingscheme.

In an embodiment of the method, further comprising:

performing the sampling operation comprises:

determining an effective spatial resolution associated with each layer;and

selecting multiple elemental images, from a portion of the plurality ofelemental images, in accordance with the determined angular resolution.

In an embodiment of the method, the angular resolution is determined asa function of a directional resolution associated with the portion ofthe scene associated with each layer.

In an embodiment of the method, the three-dimensional descriptioncomprises light field data representing a plurality of elemental images.

In an embodiment of the method, the light field data includes a depthmap corresponding to the elemental images.

In an embodiment of the method, further comprising:

receiving the second data set;

reconstructing the portions associated with a layer using the directionsof normals on surfaces included in the scene for calculation of aspecular component;

combining the reconstructed portions into a light field; and

presenting the light field image on a display device.

In an embodiment of the method, further comprising:

receiving user-input indicative of a location of a user with respect tothe light field image; and

updating the light field image in accordance with the user-input priorto presentation on the display device.

In an embodiment of the method, the first data set includes extra-pixelinformation comprising information on the directions of normals storedin a geometry buffer.

In an embodiment of the method, the geometry buffer further stores colorand depth information.

In an embodiment of the method, layers located closer to the displaysurface achieve a lower compression ratio than layers of the same widthlocated further away from the display surface.

In an embodiment of the method, the multiple layers of the second dataset comprise light fields.

In an embodiment of the method, further comprising the merging of thelight fields to create a final light field.

In an embodiment of the method, the partitioning of the layers comprisesrestricting the depth range of each layer.

In an embodiment of the method, the layers located closer to the displaysurface are narrower in width than layers located farther away from thedisplay surface In an embodiment of the method, the partitioning of thefirst data set into a plurality of layers maintains a uniformcompression rate across the scene.

In an embodiment of the method, the partitioning of the first data setinto a plurality of layers comprises partitioning the light fielddisplay into inner and outer frustum volume layer sets.

In an embodiment of the method, the method is used to used to generate asynthetic light field for multi-dimensional video streaming,multi-dimensional interactive gaming, real-time interactive content, orother light field display scenarios.

In an embodiment of the method, the synthetic light field is generatedonly in a valid viewing zone.

According to an aspect, there is a computer-implemented methodcomprising:

partitioning a three-dimensional surface description of a scene intolayers, each layer having an associated light field and sampling scheme;

further partitioning at least one layer into a plurality of subsections,each subsection having an associated light field and sampling, wherein alocation of a particular subsection is determined in accordance withgeometry of at least a portion of an object represented within thescene;

rendering a first set of pixels, comprising extra-pixel information, foreach layer and each subsection in accordance with the sampling schemeand corresponds to a sampled light field;

reconstructing the sampled light field for each layer and subsectionusing the first set of pixels; and

merging the reconstructed light fields into a single output light fieldimage.

Embodiments can include one or more of the following features.

In an embodiment of the method, the first set of pixels and associatedextra-pixel information is partitioned into subsets, wherebyreconstructing sampled light fields for each layer and subsection andmerging are performed using pixels from a single subset in a cache tocreate a subset of the output light field image.

In an embodiment of the method, reconstructing the sampled light fieldfor each layer and subsection is performed by re-projecting pixels inthe first set from the cache to create the subset of the output lightfield image.

In an embodiment of the method, re-projecting pixels is performed usinga warping process along a single dimension in the first set of pixelsfollowed by a second warping process in a second dimension in the firstset of pixels.

In an embodiment of the method, the three-dimensional surfacedescription comprises information on directions of normals on surfacesin the scene.

In an embodiment of the method, the directions of normals arerepresented with respect to a reference direction.

In an embodiment of the method, at least some of the surfaces havenon-Lambertian reflection properties.

In an embodiment of the method, the first set of pixels when renderedcomprises extra-pixel information comprising of normals.

In an embodiment of the method, further comprising:

reconstructing the sampled light field for each layer and subsectionusing the first set of pixels, including normals on surfaces included inthe scene for calculation of a specular component.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features of the invention will become more apparent inthe following detailed description in which reference is made to theappended drawings.

FIG. 1: is a schematic representation (block diagram) of an embodimentof a layered scene decomposition (CODEC) system according to the presentdisclosure.

FIG. 2: is a schematic top-down view of the inner frustum volume andouter frustum volume of a light field display.

FIG. 3A: illustrates schematically the application of edge adaptiveinterpolation for pixel reconstruction according to the presentdisclosure.

FIG. 3B: illustrates a process flow for reconstructing a pixel array.

FIG. 4: illustrates schematically elemental images specified by asampling scheme within a pixel matrix, as part of the image (pixel)reconstruction process according to the present disclosure.

FIG. 5: illustrates schematically a column-wise reconstruction of apixel matrix, as part of the image (pixel) reconstruction processaccording to the present disclosure.

FIG. 6: illustrates a subsequent row-wise reconstruction of the pixelmatrix, as part of the image (pixel) reconstruction process according tothe present disclosure.

FIG. 7: illustrates schematically an exemplary CODEC system embodimentaccording to the present disclosure.

FIG. 8: illustrates schematically an exemplary layered scenedecomposition of an image data set (a layering scheme of ten layers)correlating to the inner frustum light field of a display.

FIG. 9: illustrates schematically an exemplary layered scenedecomposition of image data (two layering schemes of ten layers)correlating to the inner frustum and outer frustum light field regions,respectively, of a display.

FIG. 10: illustrates an exemplary CODEC process flow according to thepresent disclosure.

FIG. 11: illustrates an exemplary process flow for encoding 3D image(scene) data to produce layered and compressed core encoded (lightfield) representations, according to the present disclosure.

FIG. 12: illustrates an exemplary process flow for decoding core encodedrepresentations to construct a (display) light field at a display,according to the present disclosure.

FIG. 13: illustrates an exemplary process flow for encoding and decodingresidue image data for use with core image data to produce a(display/final) light field at a display according to the presentdisclosure.

FIG. 14: illustrates an exemplary CODEC process flow including specularlight calculation, according to the present disclosure.

FIG. 15: illustrates an alternate exemplary CODEC process flow includingspecular light calculation, according to the present disclosure.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates generally to CODEC systems and methods forlight field data or multi-dimensional scene data compression anddecompression to provide for the efficient (rapid) transmission andreconstruction of a light field at a light field display.

Various features of the invention will become apparent from thefollowing detailed description taken together with the illustrations inthe Figures. The design factors, construction and use of the layeredscene decomposition CODEC disclosed herein are described with referenceto various examples representing embodiments which are not intended tolimit the scope of the invention as described and claimed herein. Theskilled technician in the field to which the invention pertains willappreciate that there may be other variations, examples and embodimentsof the invention not disclosed herein that may be practiced according tothe teachings of the present disclosure without departing from the scopeand spirit of the invention.

Definitions

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention pertains.

The use of the word “a” or “an” when used herein in conjunction with theterm “comprising” may mean “one,” but it is also consistent with themeaning of “one or more,” “at least one” and “one or more than one.”

As used herein, the terms “comprising,” “having,” “including” and“containing,” and grammatical variations thereof, are inclusive oropen-ended and do not exclude additional, unrecited elements and/ormethod steps. The term “consisting essentially of” when used herein inconnection with a composition, device, article, system, use or method,denotes that additional elements and/or method steps may be present, butthat these additions do not materially affect the manner in which therecited composition, device, article, system, method or use functions.The term “consisting of” when used herein in connection with acomposition, device, article, system, use or method, excludes thepresence of additional elements and/or method steps. A composition,device, article, system, use or method described herein as comprisingcertain elements and/or steps may also, in certain embodiments consistessentially of those elements and/or steps, and in other embodimentsconsist of those elements and/or steps, whether or not these embodimentsare specifically referred to.

As used herein, the term “about” refers to an approximately +/−10%variation from a given value. It is to be understood that such avariation is always included in any given value provided herein, whetheror not it is specifically referred to.

The recitation of ranges herein is intended to convey both the rangesand individual values falling within the ranges, to the same place valueas the numerals used to denote the range, unless otherwise indicatedherein.

The use of any examples or exemplary language, e.g. “such as”,“exemplary embodiment”, “illustrative embodiment” and “for example” isintended to illustrate or denote aspects, embodiments, variations,elements or features relating to the invention and not intended to limitthe scope of the invention.

As used herein, the terms “connect” and “connected” refer to any director indirect physical association between elements or features of thepresent disclosure. Accordingly, these terms may be understood to denoteelements or features that are partly or completely contained within oneanother, attached, coupled, disposed on, joined together, incommunication with, operatively associated with, etc., even if there areother elements or features intervening between the elements or featuresdescribed as being connected.

As used herein, the term “light field” at a fundamental level refers toa function describing the amount of light flowing in every directionthrough points in space, free of occlusions. Therefore, a light fieldrepresents radiance as a function of position and direction of light infree space. A light field can be synthetically generated through variousrendering processes or may be captured from a light field camera or froman array of light field cameras.

A light field may be described most generally as a mapping between a setof points in 3D space with a corresponding set of directions onto a setor sets of energy values. In practice, these energy values would be red,green, blue color intensities, or potentially other radiationwavelengths.

As used herein, the term “light field display” is a device whichreconstructs a light field from a finite number of light field radiancesamples input to the device. The radiance samples represent the colorcomponents red, green and blue (RGB). For reconstruction in a lightfield display, a light field can also be understood as a mapping from afour-dimensional space to a single RGB color. The four dimensionsinclude the vertical and horizontal dimensions (x,y) of the display andtwo dimensions describing the directional components (u,v) of the lightfield. A light field is defined as the function:

LF:(x,y,u,v)→(r,g,b)

For a fixed x_(f), y_(f), LF(x_(f),y_(f),u,v) represents a twodimensional (2D) image referred to as an “elemental image”. Theelemental image is a directional image of the light field from the fixedx_(f), y_(f) position. When a plurality of elemental images areconnected side by side, the resulting image is referred to as an“integral image”. The integral image can be understood as the entirelight field required for the light field display.

As used herein, the term “description of a scene” refers to a geometricdescription of a three-dimensional scene that can be a potential sourcefrom which a light field image or video can be rendered. This geometricdescription may be represented by, but is not limited to, points,quadrilaterals, and polygons.

As used herein, the term “display surface” may refer to the set ofpoints and directions as defined by a planar display plane and physicalspacing of its individual light field hogel elements, as in atraditional 3D display. In the present disclosure, displays, asdescribed herein, can be formed on curved surfaces, thus the set ofpoints then would reside on the curved display surface, or any otherdesired display surface geometry that may be imagined. In the abstractmathematical sense, a light field may be defined and represented on anygeometrical surface and may not necessarily correspond to a physicaldisplay surface with actual physical energy emission capabilities

As used herein, the term “elemental image” represents a two dimensional(2D) image, LF(x_(f),y_(f),u,v), for a fixed x_(f), y_(f),LF(x_(f),y_(f),u,v). The elemental image is a directional image of thelight field from the fixed x_(f), y_(f) position.

As used herein, the term “integral image” refers to a plurality ofelemental images connected side by side, the resulting image thereforereferred to as the “integral image”. The integral image can beunderstood as the entire light field required for the light fielddisplay.

As used herein, the term “layer” refers to any two parallel ornon-parallel boundaries, with consistent or variable width, parallel ornon-parallel to a display surface.

As used herein, the term “pixel” refers to a light source and lightemission mechanism used to create a display.

It is contemplated that any embodiment of the compositions, devices,articles, methods and uses disclosed herein can be implemented by oneskilled in the art, as is, or by making such variations or equivalentswithout departing from the scope and spirit of the invention.

Layered Scene Decomposition (LSD) CODEC System and Methods

The CODEC according to the present disclosure applies a strategy ofdrawing upon known sampling, rendering, and view synthesis methods forgenerating light field displays, adapting said strategies for use inconjunction with a novel layered scene decomposition strategy asdisclosed herein, including its derivation, implementation andapplications.

3D Displays

A conventional display as previously known in the art consists ofspatial pixels substantially evenly-spaced and organized in atwo-dimensional array allowing for an idealized uniform sampling. Bycontrast, a three-dimensional display requires both spatial and angularsamples. While the spatial sampling of a typical three-dimensionaldisplay remains uniform, the angular samples cannot necessarily beconsidered uniform in terms of the display's footprint in angular space.For a review of various light field parameterizations for angular raydistributions, please see U.S. Pat. No. 6,549,308.

The angular samples, also known as directional components of the lightfield, can be parameterized in various ways, such as the planarparameterizations taught by Gortler et. al in “The Lumigraph”. When thelight field function is discretized in terms of position, the lightfield can be understood as a regularly-spaced array ofplanar-parameterized pinhole projectors, as taught by Chai in “PlenopticSampling”. For a fixed x_(f), y_(f) the elemental imageLF(x_(f),y_(f),u,v) represents a two-dimensional image which may beunderstood as an image projected by a pinhole projector with anarbitrary ray parameterization. For a light field display, thecontinuous elemental image is represented by a finite number of lightfield radiance samples. For an idealized, planar parameterized pinholeprojector, said finite number of samples are mapped into the image planeas a regularly-spaced array (the regular spacing within the plane doesnot correspond to a regular spacing in the corresponding angulardirectional space).

In the case of a typical 3D light field display, the set of points anddirections would be defined by the planar display plane and physicalspacing of its individual light field hogel elements. However, it isknown that displays can be formed on curved surfaces, thus the set ofpoints then would reside on the curved display surface, or any otherdesired, display surface geometry that may be imagined. In the abstractmathematical sense, a light field can be defined and represented on anygeometrical surface and may not necessarily correspond to a physicaldisplay surface with actual physical energy emission capabilities. Theconcept of surface light field in the literature illustrates this case,as shown by Chen et al.

The consideration of planar parameterizations is not intended to limitthe scope or spirit of the present disclosure, as the directionalcomponents of the light field can be parameterized by a variety of otherarbitrary parameterizations. For example, lens distortions or otheroptical effects in a physically embodied pinhole projector can bemodeled as distortions of the planar parameterization. In addition,display components may be defined through a warping function, such astaught by Clark et al. in “A transformation method for thereconstruction of functions from nonuniformly spaced samples.”

A warping function α(u,v) defines a distorted planar parameterization ofthe pinhole projector, producing arbitrary alternate angulardistributions of directional rays in the light field. The angulardistribution of rays propagating from a light field pinhole projector isdetermined by the pinhole projector's focal length f and a correspondingtwo dimensional warping function α(u,v).

An autostereoscopic light field display projecting a light field for oneor more users is defined as:

D=(M _(x) ,M _(y) ,N _(u) ,N _(v) ,f,D _(LP))

Where (M_(x),M_(y)) are the horizontal and vertical dimensions of thedisplay's spatial resolution and (N_(u),N_(v)) are the horizontal andvertical dimensions of the display's angular resolution components. Thedisplay is an array of idealized light field projectors, with pitchD_(LP), focal length f, and a warping function α defining thedistribution of ray directions for the light field projected by thedisplay.

A light field LF(x,y,u,v) driving a light field displayD=(M_(x),M_(y),N_(u),N_(v),f,α,D_(LP)) requires M_(x) light fieldradiance samples in the x direction, M_(y) light field radiance samplesin the y direction, and N_(u), and N_(v) light field radiance samples inthe u and v directions. While D is defined with a single warpingfunction α, each of the light field planar-parameterized pinholeprojectors within the array of idealized light field pinhole projectorsmay have a unique warping function α, if significant microlensvariations exist in a practical pinhole projector causing the angularray distribution to vary significantly from one microlens to anothermicrolens.

Light Field Display Rendering

In “Fast computer graphics rendering for full parallax spatialdisplays,” Halle et al. provide a method for rendering objects locatedwithin an inner frustum volume and outer frustum volume of the display.FIG. 2 illustrates a light field display representing objects within avolumetric region defined by these two separate viewing frusta, with theinner frustum volume (110) located behind the display surface (300)(i.e., within the display) and the outer frustum volume (210) located infront of the display surface (i.e. outside of the display). Asillustrated, various objects (shown schematically as prismatic andcircular shapes) are located at varying depths from the display surface(300).

Halle et al. teach a double frustum rendering technique, where the innerfrustum volume and outer frustum volume are separately rendered as twodistinct light fields. The inner frustum volume LF_(O)(x,y,u,v) andouter frustum volume LF_(P)(x,y,u,v) are recombined into the singlelight field LF(x,y,u,v) through a depth merging process.

The technique uses a pinhole camera rendering model to generate theindividual elemental images of the light field. Each elemental image(i.e. each rendered planar-parameterized pinhole projector image)requires the use of two cameras: one camera to capture the inner frustumvolume and one camera to capture the outer frustum volume. Halle et al.teach rendering a pinhole projector image at a sampling region of thelight field using a standard orthoscopic camera and its conjugatepseudoscopic camera. For a pinhole camera C, the corresponding conjugatecamera is denoted as C*.

To capture an elemental image within a light field display withprojectors parameterized using warping function α, a generalized pinholecamera based on a re-parameterization of an idealizedplanarly-parameterized pinhole camera is used. As taught by Gortler etal., a pinhole camera C with a focal length f has light rays defined bya parameterization created by two parallel planes. Pinhole camera Ccaptures an image I_(C)(u,v), where (u,v) are coordinates in the rayparameterization plane. The generalized pinhole camera, C_(α), is basedupon a planar parameterized camera warped using a two dimensional,continuous, invertible time-warping function, as taught by Clark et al.With a warping function α(u,v), the inverse is γ(u,v). Therefore, theimage of C_(α), I_(Cα)=I_(C)(α(u,v)).

Given the generalized pinhole camera, C_(α), a conjugate generalizedcamera C_(α)* is formed to complete double frustum rendering. The viewsgenerated from a M_(x)×M_(y) grid of generalized pinhole camera pairsare rendered to render the light field for the light field display.

Therefore, the set of all generalized pinhole camera pairs that must berendered to produce light field LF(x,y,u,v) for a given light fielddisplay D=(M_(x),M_(y),N_(u),N_(v),f,α,D_(LP)) is defined as:

{(C _(α) ,C _(α)*)(x,y)|1≤x≤M _(x),1≤y≤M _(y)}

A set of orthoscopic cameras (O={(C_(α)(x,y)|1≤x≤M_(x), 1≤y≤M_(y)})capture the light field image corresponding to the inner frustum volumeand a set of conjugate generalized cameras (P={(C_(α)*(x,y)|1≤x≤M_(x),1≤y≤M_(y)}) capture the image corresponding to the outer frustum volume.As described above, the inner frustum volume and outer frustum volumeare combined into a single light field.

Real-Time Rendering

It is believed that a usable light field display could require at least1-10 billion pixels, each representing different directional rays of thelight field. Considering a modest interactive frame rate of 30 Hz andassuming each raw ray pixel requires 24 bits, this results in asignificant bandwidth requirement of (10 billion pixels)×(24bits/pixel)×(30 frames/sec)=720 Gbits/sec bandwidth. Since displays ofeven higher fidelity will ultimately be desired, this requirement canrealistically be seen to scale into 100s of Tbit/s as this displaytechnology reaches consumer markets and the technology continues toevolve in terms of visual fidelity.

Interactive computer graphics rendering is a computational processwhich, at least conventionally, requires computing a simulation of avirtual camera imaging a scene. The scene is typically described ascollections of light sources and surfaces or volumes with variousmaterial, color and physical optical properties and various viewingcamera positions. This rendering computation must be performed rapidlyenough to produce an interactive frame rate (e.g. at least 30 Hz). Therendering fidelity can be adjusted based on what degree the lighttransport calculations are approximated, which of course decreases thecomputational requirements as more approximation is used. It is for thisreason that interactive computer graphics generally have a lower visualfidelity than offline rendered graphics where very high-fidelity lighttransport models are employed.

The requirement for interactivity implies a certain frame rate(typically at least 20-30 Hz, but often desired to be higher) with acorresponding bandwidth but also an implication of reduced latency tosupport immediate graphical response to user input. The high bandwidthcombined with the latency requirements impose certain challenges interms of computation.

In conventional 2D computer graphics, meeting the challenges oflow-latency, high frame-rate graphics has led to the widespread use ofspecialized hardware designed to accelerate interactive renderingcalculations, known as graphics processing units (GPUs). Thesespecialized architectures can produce much higher visual fidelity atinteractive rates than generalized central processing units (CPU) usedin modern computers. While impressive in their performance capabilities,these architectures are ultimately optimized for their specific task,which is namely rendering single camera images of scenes at high framerates while maximizing visual quality.

For light field displays, the rendering problem becomes that ofrendering an image produced by a virtual light field camera. A lightfield camera (defined elsewhere in more detail), may be viewed as anarray of many conventional 2D camera views. This more general cameramodel results in calculations with a substantially different geometricstructure. The result is that the calculations do not map well into theframework of existing accelerated computer graphics hardware.

In the conventional case, a rendering calculation pipeline procedure isdefined. This has traditionally been based on rasterization, butray-tracing pipelines have been standardized as well (e.g. recentlyDirectX ray tracing). In either case, the computational hardwarearchitecture is tailored to the form of these pipelines and theirassociated required calculations, with the ultimate goal being theproduction of 2D images at video frame rates.

What is required is a different pipeline for interactive light fieldrendering calculation that may be realized with a minimal hardwarefootprint in order to minimize costs, size, weight and powerrequirements for the rendering architecture. These requirements aredriven by the desire to eventually create a consumer product at thecorresponding price range.

When considering a pipeline for light field rendering and the large datarates as required, one main bottleneck involves the potential memorybandwidth that would be required. In conventional 2D video rendering andprocessing, buffering entire frames (or even subsequences of frames) indouble data rate (DDR) memory is a commonly performed operation. It isan observation that the data rates for DDR and its cost vs. capacitymake it quite suitable in these types of applications. However, thelight field bandwidth requirements as discussed previously suggest thatsignificant DDR buffering may be quite constraining in terms of physicalfootprint and cost.

The first step in rendering is generally loading the scene descriptionrepresentation from storage in DDR memory. One notable aspect of lightfield rendering is that since each light field camera rendering pass canbe viewed as an array of more or less conventional 2D camera renderingpasses. Naively, each of these 2D cameras (hogels) must be renderedtwice for the inner and outer hogel, in the method of “double frustumrendering” as suggested by Halle. The number of rays then is 2 for everydirection represented by the display. An alternate scheme that isevident from the existing art is to define an inner and outer far clipplane and have rays cast from the outer far clip plane, through hogelson the display surface and end at the inner far clip plane (or viceversa). This results in one ray per pixel.

In the worst case, each of these 2D camera rendering passes in the arrayrequires loading the entire scene description. Even in more optimisticcases, the repeated loading of the scene description from DDR naivelycan result in a large bandwidth requirement, especially compared toconventional 2D rendering where the scene is generally only accessed atmost a small number of times per frame.

It is worth considering if such a situation can be addressed through useof a caching strategy, given that redundant memory access appears to bepresent. In high-performance computing, caching data in a smaller butfaster storage (typically located directly on the die of the chip wherecomputations are being performed) can significantly alleviate DDRbandwidth constraints, when the computation is structured in such a waythat data is redundantly loaded in some kind of coherent, predictablepattern. In terms of light field rendering, each hogel's elemental imagerendering requires the same scene description in the worst-case, thusidentifying a significant potential redundancy to be exploited bycaching effectively. Modern ray-tracing techniques for surface renderingof a single camera view of a scene are able to exploit cache coherencyby the principle that coherent rays on the image plane often intersectthe same geometry (or at least for the primary intersection point)(reference Wald, etc.). Rasterization from polygons onto a singleimaging camera, inherently exploits this same coherency, as a singlepolygon can be cached in hardware once loaded and all of the pixels itintersects can be calculated in an hardware accelerated rasterizationprocess. In the context of light fields, it is evident that these samecoherency principles can be exploited within the individual 2D cameraviews that make up a light field, if they are rendered using ray tracingor rasterization.

It can also be observed that imaging rays from different hogels in alight field camera intersect the same polygons, presenting an additionalelement of coherency. It is proposed that this coherency can beexploited in a structured form in order to produce a higher-performance,interactive rendering system for light field displays. It is shown howthis coherency can be exploited by buffering a structured intermediate,partially rendered form of the output light field. This buffer is heretoreferred to as the light field surface buffer, or surface buffer. Thepixels in this buffer can also contain extra-pixel information, forexample color, depth, surface coordinates, normals, material values,transparency values, and other possible scene information.

It is proposed that the surface buffer can be rendered efficiently usingwhat is effectively a conventional 2D rendering ray-tracing pipeline.The surface buffer is based on the concept of a layered scenedecomposition and a sampling scheme as presented herein, which specifieswhich pixels are to comprise the surface buffer. Based on analysispresented within this specification, it may be shown that with anappropriately chosen layered scene decomposition and sampling scheme,the resulting surface buffer can be determined to contain less pixelsthan the desired rendered output light field image frame, as it can beviewed as a form of a data compression scheme.

Furthermore, an appropriately chosen layered scene decomposition andsampling scheme will result in a surface buffer that will containsamples of all the surfaces areas in the scene visible from any of thehogels in the targeted light field camera view. Such a surface bufferwill contain data to enable reconstruction of the light fieldsassociated with each layer and layer subsection. Once reconstructed,these light fields can be merged into a single light field image,representing the desired rendering output, as described elsewhere inthis document.

It is further proposed that the resulting surface buffer can bepartitioned into smaller subsets. This partitioning can occur in such away that each subset of the surface buffer data can be used by itself toreconstruct some portion of the resulting output light field. Onepractical embodiment involves partitioning layers and subsections whosesize is based on the AEI function, then choosing a sampling scheme thatincludes a small number (e.g. 4) of elemental images per partition,which are then used to reconstruct the un-sampled elemental imageswithin the partition. If this partitioning is chosen appropriately,subsets of the surface buffer can be loaded into a faster cache memory,from which reconstruction and merging calculations can be performedwithout resorting to repeated loads from slower system memories.

Thus to summarize, an efficient method to render light field video atinteractive rates can be described as starting with a 3D description ofa scene, rendering a surface buffer, then rendering the final outputframe by reconstructing layers and subsections from cached individualpartitions of the surface buffer in order to create correspondingportions of the desired output light field image. When rendering isstructured in this form, as opposed to applying a brute forcemethodology which performs light field rendering as a number ofconventional 2D rendering passes, less slow memory bandwidth is requiredas cache memories are able to be exploited in a structured fashion bypartitioning of the surface buffer.

Data Compression for Light Field Display

Piao et al. utilize a priori physical properties of a light field inorder to identify redundancies in the data. The redundancies are used todiscard elemental images based on the observation that elemental imagesrepresenting neighboring points in space contain significant overlappedinformation. This avoids performing computationally complex datatransforms in order to identify information to discard. Such methods donot utilize depth map information associated with each elemental image.

In “Compression for Full-Parallax Light Field Displays,” Graziosi et al.propose criteria to sub-sample elemental images based on simple pinholecamera coverage geometry to reduce light field redundancy. Thedownsampling technique taught by Graziosi et al. is simpler than thecomplicated basis decompositions often employed in other CODEC schemesfor two-dimensional image and video data. Where an object is locateddeep within a scene, the light field is sampled at a smaller rate. Forexample, when two separate pinhole cameras provide two different fieldsof view, there is very little difference from one elemental image to thenext elemental image, and the fields of view from the two pinholecameras overlap. While the views are subsampled based on geometric(triangle) overlap, the pixels within the views are not compressed.Because these pixels can be substantial, Graziosi et al. compress thepixels with standard two-dimensional image compression techniques.

Graziosi et al. teach that the sampling gap (ΔEI) between elementalimages, based on the minimum depth of an object d, can be calculated asfollows, where θ represents the light field display's field of view andP represents the lens pitch of the integral imaging display:

${\Delta \; {EI}} = \frac{\left( {2d} \right)\tan \; \left( {\theta/2} \right)}{P}$

This strategy provides a theoretically lossless compression forfronto-parallel planar surfaces when there are no image occlusions. Asshown in the formula, the sampling gap increases with d, providing animproved compression rate when fewer elemental images are required. Forsufficiently small d, ΔEI can reach 0. Therefore, this downsamplingtechnique gives no guaranteed compression rate. In a scene with multiplesmall objects, where the objects are close to the screen or are at thescreen distance, each elemental image would have at least some pixels ata 0 depth and this technique would provide no gains, i.e. ΔEI=0throughout the integral image.

Graziosi et al. equate the rendering process with the initial encodingprocess. Instead of producing all of the elemental images, this methodonly produces the number needed to reconstruct the light field whileminimizing any loss of information. Depth maps are included with theelemental images selected for encoding and the missing elemental imagesare reconstructed using well-established warping techniques associatedwith depth image-based rendering (DIBR). The selected elemental imagesare further compressed using methods similar to the H.264/AVC method,and the images are decompressed prior to the final DIBR-based decodingphase. While this method provides improved compression rates withreasonable signal distortion levels, no time-based performance resultsare presented. Such encoding and decoding cannot provide goodlow-latency performance for high bandwidth rates. In addition, thismethod is limited to use for a single object that is far away from thedisplay surface; in scenes with multiple overlapping objects and manyobjects close to the display surface, the compression would be forcedback to use H.264/AVC style encoding.

Chai teaches plenoptic sampling theory to determine the amount ofangular bandwidth required to represent fronto-parallel planar objectsat a particular scene depth. Zwicker et al. teach that the depth offield of a display is based on the angular resolution, with moreresolution providing a greater depth of field. Therefore, objects closeto the display surface are represented adequately with lower angularresolution, while far objects require larger angular resolutions.Zwicker et al. teach the maximum display depth of field with idealprojective lenses based on planar parameterization is:

$Z_{DOF} = \frac{fP_{l}}{P_{P}}$

where P_(l) is the lens pitch and P_(p) is the pixel pitch and f is thefocal length of the lenses. In a three-dimensional display with anisotropic directional resolution (i.e. N=N_(u)=N_(v)), N=P_(l)/P_(p).Therefore, Z_(DOF)=fN.

To determine the angular resolution required to represent the fullspatial resolution of the display, at a given depth d, the equation isrearranged as:

${N_{res}(d)} = \frac{d}{f}$

Therefore, each focal length distance into the scene adds another pixelof angular resolution required to fully represent objects at the givenspatial resolution of the display surface.

Layered Scene Decomposition and Sampling Scheme

The sampling gap taught by Graziosi et al. and the plenoptic samplingtheory taught by Zwicker et al. provide complimentary light fieldsampling strategies: Graziosi et al. increase downsampling for distantobjects (ΔEI) while Zwicker et al. increase downsampling for nearobjects (N_(res)). However, when downsampling a single light fieldrepresenting a scene, the combination of these strategies does notguarantee compression. Therefore, the present disclosure divides amultiple-dimensional scene into a plurality of layers. This divisioninto a plurality of (data) layers is referred to herein as a layeredscene decomposition. Where K₁ and K₂ are natural numbers, we defineL=(K₁,K₂,L^(O),L^(P)), partitioning the inner and outer frustum volumesof a three-dimensional display. The inner frustum is partitioned into aset of K₁ layers, where L^(O)={l₁ ^(O),l₂ ^(O) . . . l_(K) ₁ ^(O)}. Eachinner frustum layer is defined by a pair of boundaries parallel to thedisplay surface at distances d_(min)(l_(i) ^(O)) and d_(max)(l_(i) ^(O))for 1≤i≤K₁ from the display surface. The outer frustum is partitionedinto a set of K₂ layers, where L^(P)={l₁ ^(P),l₂ ^(P) . . . l_(K) ₂^(O)}. Each outer frustum layer is defined by a pair of boundariesparallel to the display surface at distances d_(min)(l_(i) ^(P)) andd_(max)(l_(i) ^(P)) for 1≤i≤K₂ from the display surface. In alternateembodiments, the inner and outer frustum volumes may be divided bylayering schemes differing from each other and the pair of boundariescan be but also may not be parallel to the display surface.

Each of the layered scene decomposition layers has an associated lightfield (herein also referred to as a “light field layer”) based on thescene restrictions to the planar bounding regions of the layer. Considera layered scene decomposition L=(K₁,K₂,L^(O),L^(P)) for a light fielddisplay D=(M_(x),M_(y),N_(u),N_(v),f,a,D_(LP)) with an inner frustumlayer l_(i) ^(P)ϵL^(P) for 1≤i≤K₁, or an outer frustum layer l_(j)^(O)ϵL^(O) for 1≤j≤K₂. The inner frustum light field LF_(l) _(i) _(O) isgenerated from the set of generalized pinhole camerasO={C_(α)(x,y)|1≤x≤M_(x), 1≤y≤M_(y)}. This equation is constrained suchthat only the space at distance d from the light field display surface,where d_(min)(l_(i) ^(P))≤d≤d_(max)(l_(i) ^(P)), is imaged. Therefore,for an inner frustum layer with a fixed x,y and C_(α)(x,y)ϵO, wecalculate LF_(l) _(i) _(O) (x, y, u, v)=I_(c) _(α) _((x,y)). Similarly,the outer frustum light field LF_(l) _(i) _(P) is generated from the setof generalized pinhole cameras P={C_(α)*(x,y)|1≤x≤M_(x), 1≤y≤M_(y)}.This equation is constrained such that only the space at distance d fromthe light field display surface, where d_(min)(l_(i)^(P))≤d≤d_(max)(l_(i) ^(P)), is imaged. Therefore, for an outer frustumlayer with a fixed x, y and C_(α)(x,y)ϵP, we calculate LF_(i) _(i) _(P)=I_(C) _(α) _((x,y)).

The sets of light fields for the inner and outer frustum regionsrelative to the layered scene decomposition L can be further defined.Assume a light field display D=(M_(x),M_(y),N_(u),N_(v),f,a,D_(LP)) witha layered scene decomposition L=(K₁,K₂,L^(O),L^(P)). The set of innerfrustum region light fields is defined as O^(LF)={LF_(l) _(i) _(O)|1≤i≤K₁}. The set of outer frustum region light fields is defined asP^(LF)={LF_(l) _(i) _(P) |1≤i≤K₂}.

As defined, a layered scene decomposition generates a light field foreach layer. For any layered scene decomposition, orthoscopic camerasgenerate inner frustum volume light fields and pseudoscopic camerasgenerate outer frustum volume light fields. If a scene captured by thesegeneralized pinhole camera pairs is comprised of only opaque surfaces,each point of the light field has an associated depth value whichindicates the distance from the generalized pinhole camera plane to thecorresponding point in space imaged. When given a light field LF_(l)_(i) _(O) ϵO^(LF) or LF_(l) _(i) _(O) ϵP^(LF), the LF_(l) _(i) _(O)depth map is formally defined as D_(m)[LF_(l) _(i) _(O) ](x,y,u,v), andthe LF_(l) _(i) _(P) depth map is formally defined as D_(m)[LF_(l) _(i)_(P) ](x,y,u,v). The depth maps D_(m)=∞ where there are no surfaceintersection points corresponding to the associated imaging generalizedpinhole camera rays. Across their domains, d_(min)(l_(i)^(P))≤D_(m)[LF_(l) _(i) _(P) ](x,y,u,v)≤d_(max)(l_(i) ^(P)) andd_(min)(l_(i) ^(O))≤D_(m)[LF_(l) _(i) _(O) ](x,y,u,v)≤d_(max)(l_(i)^(O)). In other words, depth maps associated with a layered scenedecomposition layer's light field are bound by the depth bounds of thelayer itself.

A merging operation can re-combine the layered scene decomposition layersets back into the inner and outer frustum volumes, or LF_(O) andLF_(P). The inner and outer frustum volume light fields are merged withthe merging operator *_(m). For example, when given two arbitrary lightfields, LF₁(x,y,u,v) and LF₂(x,y,u,v), wherei=argmin_(jϵ{1,2})D_(m)[LF_(j)](x,y,u,v), *_(m) is defined as:

LF ₁(x,y,u,v)*_(m) LF ₂(x,y,u,v)=LF _(i)(x,y,u,v)

Therefore, LF_(O)(x,y,u,v) and LF_(P)(x,y,u,v) can be recovered from thesets O^(LF) and P^(LF) by merging the light fields associated with theinner and outer frustum layers. For example:

LF_(O) = LF_(l₁^(O)) * mLF_(l₂^(O)) * m  … * mLF_(l_(K₁)^(O))LF_(P) = LF_(l₁^(P)) * mLF_(l₂^(P)) * m  … * mLF_(l_(K₁)^(P))

The present disclosure provides a layered scene decomposition operationand an inverse operation which merges the data to reverse saiddecomposition. Performing a layered scene decomposition with K layers isunderstood to create K times as many individual light fields. The valueof the layered scene decomposition is in the light fields induced by thelayers; these light field layers are more suitable for downsampling thanthe original total light field or the inner frustum volume or outerfrustum volume light fields, as the total data size required formultiple downsampled layered scene decomposition light field layers withan appropriate sampling scheme is significantly less than the size ofthe original light field.

The skilled technician in the field to which the invention pertains willappreciate that there are multiple types of sampling schemes that cansuccessfully sample a light field. The sampling scheme S provided is notintended to limit or depart from the scope and spirit of the invention,as other sampling schemes, such as specifying individual sampling ratesfor each elemental image in the layered scene decomposition layer lightfields, can be employed. Relatively simple sampling schemes can providean effective CODEC with greater sampling control; therefore, the presentdisclosure provides a simple sampling scheme to illustrate thedisclosure without limiting or departing from the scope and spirit ofthe invention.

A light field sampling scheme provided according to the presentdisclosure represents a light field encoding method. Given a displayD=(M_(x),M_(y),N_(u),N_(v),f,α,D_(LP)) and a layered scene decompositionL=(K₁,K₂,L^(o),L^(P)), the present disclosure provides a sampling schemeS associated with L as an M_(x)×M_(y) binary matrix M_(S)[l_(i)]associated with any layer l_(i) in L^(o) or L^(P) and a mapping functionR (l_(i)) to map each layer l_(i) to a pair R(l_(i))=(n_(x),n_(y)). Abinary ({0,1}) entry in M_(S)[l_(i)] at (x_(m),y_(m)) indicates if theelemental image LF_(l) _(i) (x_(m),y_(m),u,v) is included in thesampling scheme: a (1) indicates LF_(l) _(i) (x_(m),y_(m),u,v) isincluded, and a (0) indicates LF_(l) _(i) (x_(m),y_(m),u,v) is notincluded. R(l_(i))=(n_(x),n_(y)) indicates the elemental images in lightfield LF_(l) _(i) are sampled at a resolution of n_(x)×n_(y).

The present disclosure also provides a layered scene decomposition lightfield encoding process that draws upon plenoptic sampling theory. Thefollowing description pertains to the inner frustum volume L^(o) of alayered scene decomposition L, but the outer frustum volume L^(P) may beencoded in a similar fashion.

For each l_(i)ϵL^(o), the depth map of the corresponding light fieldLF_(l) _(i) is restricted to d in the range d_(min)(l_(i)^(o))≤d≤_(max)(l_(i) ^(o)). Based on the sampling scheme presentedabove, the present disclosure creates a sampling scheme S using thefollowing equation to guide the creation of M_(S)[l_(i) ^(o)]:

${\Delta \; {{EI}\left( {d_{\min}\left( l_{i}^{o} \right)} \right)}} = \frac{\left( {2{d_{\min}\left( l_{i}^{o} \right)}} \right)\tan \; \left( {\theta/2} \right)}{D_{LP}}$

In other words, ΔEI guides the distance between “l” entries in the M_(S)matrix associated with each layered scene decomposition layer. Thefollowing equation sets the resolution of the individual elementalimages in a layer:

${{R\left( l_{i}^{o} \right)} = \frac{d_{\max(l_{i}^{o})}}{f}},\frac{d_{\max {(l_{i}^{o})}}}{f}$

${N_{res}\left( {d_{\max}\left( l_{i}^{o} \right)} \right)} = \frac{d_{\max}\left( l_{i}^{o} \right)}{f}$

This sampling scheme, using both ΔEI and N_(res) to drive individuallayered scene decomposition layer sampling rates, can be considered as alayered plenoptic sampling theory sampling scheme (otherwise referred toherein as “plenoptic sampling scheme”). This plenoptic sampling schemeis based on a display utilizing the plenoptic sampling theory identityfunction α(t)=t. This per-layer sampling scheme provides losslesscompression for fronto-parallel planar scene objects where the objectswithin a layer do not occlude each other.

The assumption of only fronto-parallel planar scene objects isrestrictive and does not represent typical scenes; inevitably there areintra-layer occlusions, especially for layered scene decompositionlayers that are larger in size. To capture and encode a full range ofpotential scenes without introducing significant perceivable artifacts,the system can draw upon information in addition to the light fieldplenoptic sampling scheme of the present disclosure.

For example, surfaces are locally approximated by planar surfaces atvarious slanting angles. In “On the bandwidth of the plenopticfunction,” Do et al., theorize time-warping techniques allowing for thespectral characterization of slanted light field display surfaces. Thiswork suggests that a necessary decrease in downsampling and the need forprecise characterization of local bandwidth changes is induced by thedegree of surface slanting, the depth of objects in the scene, and thepositioning of objects at the FOV edge. Therefore, if signal distortionsfrom fronto-parallel geometry deviations are perceptually significant,residue representations can adaptively send additional or supplementalelemental image data (dynamically altering the static sampling scheme)to compensate for losses incurred.

The present disclosure therefore provides for the identification as“core” or “residue” information for the encoding and decoding of thelight field by the CODEC. When given a light field display D and acorresponding layered scene decomposition L with an associated samplingscheme S, the present disclosure considers the encoded, downsampledlight fields associated with L and S, as well as the number of layeredscene decomposition layers and the depth of said layers, as the “core”representation of a light field encoded and decoded by the CODEC. Anyadditional information transmitted along with the core (encoded)representation of the light field that may be required during thedecoding process is considered as the “residue” representation of thelight field to be processed by the CODEC and used together with the corerepresentation of the light field to produce the final light fielddisplayed.

Many layered scene decompositions and sampling schemes that may bedefined in the framework defined above can still exhibit issues withholes due to occlusions after they are merged and the original lightfield is reconstructed. It is an observation that object occlusions thatoccur between objects in separate layers do not lead to holes afterreconstruction. However, objects that may occlude each other that areboth located within the same layer can lead to holes, especially forcertain sampling schemes.

To be specific, if sampling within a particular layer is such that thegap between sampled elemental images is large, then there is a highlikelihood that occluded objects can be under-represented, thusresulting in holes. One solution to this is to simply sample elementalimages at a higher rate. A higher sampling rate however results in alower compression rate. Adding more elemental images therefore canresult in the inclusion of significant redundant information. What isneeded is a more discriminant method that can include additionalinformation that helps to fill in holes while not contributing toredundancy in the overall representation. For example, consider alayered scene decomposition:

L=(K ₁ ,K ₂ ,L ^(o) ,L _(P))

For each layer l_(i) in L^(o) or L^(P), we can define a set of residuelayers:

R(l _(i))={r(l _(i))(j)|1≤j≤K _(i)}

Where K_(i) is a natural number describing the number of residue layersrequired for layer l_(i). For each residue layer, like layered scenedecomposition layers, there is a light field associated with the layer:

LF _(r(l) _(i) _()j)

In the most general description, these additional layers can be freeform with no further restrictions. In practice, additional informationthat can help to deal with occlusions is represented in these residuelayers. One way to implement this is to have residue layers have thesame sampling scheme as their parent layered scene decomposition layer,however one possible variation might be to sample the residue layerswith a lower directional resolution in order to tightly control thecompression rate of the LSD plus residue layer combination.

Specifically, the residue layers may be defined as additional layerscorresponding to the concept of Deep G-Buffers. Thus:

D _(m)[LF _(l) _(i) ](x,y,v)<D _(m)[LF _(r(l) _(i) ₎₍₁₎](x,y,u,v)< . . .<D _(m)[LF _(r(l) _(i) _()(K) _(i) ₎](x,y,u,v)

In this case, residue layers sit in contrast to layered scenedecomposition layers in the sense that the depth ranges of each layerare not fixed by pre-decided depth divisions of the layered scenedecomposition layer scheme but are based on the depth layercharacteristics inherent to the geometry in the scene being represented.

Layer-Based Compression Analysis

Predictable compression rates are required to create a real-timerendering and transmission system, together with downsampling criteria(which do not indicate achievable compression rates). The followingprovides a compression analysis of the present disclosure's layeredscene decomposition encoding strategy.

As already described, downsampling a light field based on plenopticsampling theory alone does not offer guaranteed compression rates. Thepresent disclosure provides a downsampling light field encodingstrategy, allowing for a low-latency, real-time light field CODEC. Inone embodiment, complementary sampling schemes based on plenopticsampling theory, using both ΔEI and N_(res) are employed to driveindividual layered scene decomposition layer sampling rates. The layeredscene decomposition, representing the total 3D scene as a plurality oflight fields, expands the scene representation by a factor of the numberof layers. The present disclosure further contemplates that when layerdepths are chosen appropriately, compression rates can be guaranteedwhen combined with plenoptic sampling theory based downsampling.

For a light field LF_(l) _(i) corresponding to a given layered scenedecomposition layer l_(i), the layer's restricted depth range provides aguaranteed compression rate for the layer's light field. The achievablecompression ratio from downsampling a scene completely contained withina single layer can be explained in the following theorem:

Theorem 1

Consider a display D=(M_(x),M_(y),N_(u),N_(v),f,a,D_(LP)) with anisotropic directional resolution N=N_(u)=N_(v), a layered scenedecomposition L and an associated sampling scheme S=(M_(S),R). Assume alayered scene decomposition layer l_(i) with the corresponding lightfield LF_(l) _(i) such that d_(min)(l_(i))<Z_(DOF)(D), and M_(s)[LF_(l)_(i) ] is selected so the distance between “1” entries is set toΔEI(d_(min)(l_(i))) and R(l_(i))=N_(res)(d_(max)(l_(i))). Thecompression ratio associated with S relative to the layered scenedecomposition layer l_(i) is 1:

${N^{2}\left( \frac{d_{\min}\left( l_{i} \right)}{d_{\min}\left( l_{i} \right)} \right)}.$

Proof 1

Consider a layered scene decomposition layer within the maximum depth offield of the display, where

${d_{\min}\left( l_{i} \right)} = {{\frac{Z_{DOF}}{c}\mspace{14mu} {and}\mspace{14mu} {d_{\max}\left( l_{i} \right)}} = \frac{Z_{DOF}}{d}}$

for 0<c, d≤Z_(DOF). Therefore,

$c = {{\frac{Z_{DOF}}{d_{\min}\left( l_{i} \right)}\mspace{14mu} {and}\mspace{14mu} d} = {{\frac{Z_{DOF}}{d_{\max}\left( l_{i} \right)}\mspace{14mu} {and}\mspace{14mu} {d/c}} = {\frac{d_{\min}\left( l_{i} \right)}{d_{\max}\left( l_{i} \right)}.}}}$

Therefore ΔEI(d_(min)(l_(i)))=N/c and N_(res)(d_(max)(l_(i)))=N/d.

Based on this rate of sub-sampling, the system requires every (N/c)^(th)elemental image, therefore providing a compression ratio of 1:(N/c)².The elemental image sub-sampling provides a 1:d² compression ratio.Therefore, the total compression ratio is 1:(N/c)²*1:d²=1:(d/c)². Thecompression factor term

$c_{f} = \frac{d_{\min}\left( l_{i} \right)}{d_{\max}\left( l_{i} \right)}$

determines the compression ratio.

There may be an alternate case where d_(min)(l_(i))=Z_(DOF) and(d_(max)(l_(i))) can extend to any arbitrary depth. We know ΔEI(Z_(DOF))=N and N_(res) attains the maximum possible value of N for alldepths d_(ZDOF.) Based on this rate of sub-sampling, the system requiresevery N^(th) elemental image, thus providing the light field with a 1:N²compression ratio. Adding additional layered scene decomposition layersbeyond Z_(DOF) adds redundant representational capability whenrepresenting fronto-parallel planar objects. Therefore, when creating acore encoded representation, the total scene may be optimally decomposedwith the maximum depth of field in the layers.

Given the compression calculation expression for downsampling a layeredscene decomposition layer, we can determine how the compression factorvaries as the layer parameters vary. For a layer of a fixed width, ord_(max)(l_(i))−d_(min)(l_(i))=w for some w, the c_(f) term is minimizedwhen d_(max)(l_(i))−d_(min)(l_(i)) is closest to the display surface.Therefore, layered scene decomposition layers located closer to thedisplay surface require a narrower width to achieve the same compressionratio as layers located further away from the display surface. Thiscompression rate analysis can extend to scenes that are partitioned intomultiple adjacent fronto-planar layers located in the space from thedisplay surface until the depth Z_(DOF).

Theorem 2

Consider a display D=(M_(x),M_(y),N_(u),N_(v),f,a,D_(LP)) with anisotropic directional resolution N=N_(u)=N_(v), a layered scenedecomposition L and an associated sampling scheme S=(M_(s),R). LetS_(LF)=M_(x)M_(Y)N_(u)N_(v), denoting the number of image pixels in thelight field. The compression ratio of the layered scene decompositionrepresentation can be defined as:

$\frac{A}{S_{LF}} = {{\left( {1/N^{2}} \right){\sum_{i = 1}^{K}\left( {1/{c_{f}(i)}^{2}} \right)}} = {\left( {1/N^{2}} \right){\sum_{i = 1}^{K}\left( \frac{d_{\max}\left( l_{i} \right)}{d_{\min}\left( l_{i} \right)} \right)^{2}}}}$

Proof 2

For a given layered scene decomposition layer downsampled withcompression ratio:

${S_{layer}(i)} = {\left( \frac{1}{N^{2}{c_{f}(i)}^{2}} \right)S_{LF}}$

To calculate the compression ratio, the size of each layer in thecompressed form is computed and summed, and the total compressed layersize is divided by the size of the light field. Consider a sum where thesize of the compressed set of layers is:

$A = {\sum_{i = 1}^{K}{\left( \frac{1}{N^{2}{c_{f}(i)}^{2}} \right)S_{LF}}}$

Therefore, the compression ratio of the combined layers is:

$\frac{A}{S_{LF}} = {{\left( {1/N^{2}} \right){\sum\limits_{i = 1}^{K}\left( {1/{c_{f}(i)}^{2}} \right)}} = {\left( {1/N^{2}} \right){\sum\limits_{i = 1}^{K}\left( \frac{f + {i\Delta L}}{f + {\left( {i - 1} \right)\Delta L}} \right)^{2}}}}$

In a system where the layered scene decomposition layers are of variablewidth, with d_(min)(i) and d_(max)(i) representing the front and backboundary depths of the i^(th) layer, the compression ratio of thelayered scene decomposition representation is:

$\frac{A}{S_{LF}} = {{\left( {1/N^{2}} \right){\sum\limits_{i = 1}^{K}\left( {1/{c_{f}(i)}^{2}} \right)}} = {\left( {1/N^{2}} \right){\sum\limits_{i = 1}^{K}\left( \frac{d_{\max}(i)}{d_{\min}(i)} \right)^{2}}}}$

The sum Σ_(i=1) ^(K)(1/c_(f)(i)²) for constant layered scenedecomposition layers is monotonically decreasing and tending towards 1.

Therefore, layered scene decomposition layers located closer to thedisplay surface achieve a lower compression ratio than layers of thesame width located further away from the display surface. To maximizeefficiency, layered scene decomposition layers with a narrower width arelocated closer to the display surface, and wider layered scenedecomposition layers are located further away from the display surface;this placement maintains a uniform compression rate across the scene.

Number and Size of Layered Scene Decomposition Layers

To determine the number of layers and the size of layers required forthe layered scene decomposition, a light field display with an α(t)=tidentity function, is provided as an example. The consideration of thisidentity function is not intended to limit the scope or spirit of thepresent disclosure, as other functions can be utilized. The skilledtechnician in the field to which the invention pertains will appreciatethat while the display D=(M_(x),M_(y),N_(u),N_(v),f,α,D_(LP)) is definedwith a single identity function α, each light field planar-parameterizedpinhole projector within an array of planar-parameterized pinholeprojectors may have a unique identity function α.

To losslessly represent fronto-planar surfaces (assuming no occlusions),a single layered scene decomposition layer with a front boundary locatedat depth Z_(DOF) represents the system from Z_(DOF) to infinity.Lossless compression may be defined as class of data compressionalgorithms that allows the original data to be perfectly reconstructedfrom the compressed data To generate a core representation, layeredscene decomposition layers beyond the deepest layer located at the lightfield display's maximum depth of field are not considered, as theselayers do not provide additional representative power from the corerepresentation perspective; this applies to both the inner and outerfrustum volume layer sets.

Within the region from the display surface to the maximum depth of fieldof the display (for both the inner and outer frustum volume layer sets),the layered scene decomposition layers utilize maximum and minimumdistance depths that are integer multiples of the light field display fvalue. Layered scene decomposition layers with a more narrow widthprovide a better per-layer compression ratios, thereby providing betteroverall scene compression ratios. However, a greater number of layers inthe decomposition increases the amount of processing required fordecoding, as a greater number of layers must be reconstructed andmerged. The present disclosure accordingly teaches a layer distributionscheme with differential layer depths. In one embodiment, layered scenedecomposition layers (and by correlation the light fields represented bysaid layers) with a more narrow width are located closer to the displaysurface, and the layer width (i.e., the depth difference between thefront and back layer boundaries) increases exponentially as the distancefrom the display surface increases.

A three-dimensional description of a scene is partitioned into aplurality of subsets and multiple layers or subsets are encoded togenerate a second data set, smaller than a size of the first data set.The encoding of the layer or subset may include performing a samplingoperation on the subset. An effective resolution function is used todetermine a suitable sampling rate. Elemental images associated with asubsection are then downsampled using the determined suitable samplingrate.

Related works have focused on analyzing depth of field and howresolution degrades at depth in displays with multiple light-attenuatinglayers. These works still analyze depth of field as anobserver-independent concept similarly to Zwicker et. al which describesthe depth of field of a light field display as the range of depths inwhich a virtual plane oriented parallel to the display surface can bereproduced at the display's maximum spatial resolution. This frameworkhowever does not factor in the observer, it is based on effectivelyorthographic views of the scene. The information a particular viewer hasaccess to from the light field from a given viewpoint in terms of thequality of objects at depth in a scene is addressed.

Alpaslan et. al performed a study to determine how perceived resolutionchanged at distance in a light field display in terms of varying opticaland spatio-angular resolution parameters. It was shown that increasedangular resolution reduces the decrease in perceived resolution at depthin the display. This analysis is based on patterns of oscillations thatmeasure in cycles per space, where space is that in the inner and outerfrustum of the display, as seen in the below equation.

p=p _(o) +s*tan ϕ

Where p is the smallest feature size, p_(o) is the pixel size, s is thedepth into the screen and ϕ is the angular distance between two samples.This formula is based on simple geometric arguments and the assumptionthat directional rays of the display are distributed uniformly inangular space. It clearly shows that feature size increases withdistance, however, this formula is formulated independent of aparticular observer and what feature size the observer may resolve atdepth.

In contrast, Dodgson analyzes how observers can occupy various viewingzones in front of a 3D display which correspond to density of projectionfrom the display's angular components, but do not directly relate theseviewing zones to apparent viewing quality of objects at depth.

One approach to dealing with small depth of field, or DoF, involvesscaling content to fit within the target region. This technique doesappear to produce a good result, but since it involves some optimizationon the content, it appears it would not be immediately amenable toreal-time datasets in an interactive setting. A simpler, fixed schemere-scaling technique could work in a real-time setting yet would likelyintroduce unacceptable distortion artifacts such as cardboarding.Cardboarding may be defined as a pervasive artifact that occurs whenvisualizing 3D content is the so-called “cardboarding” effect, whereobjects appear flat due to depth compression.

In cases where all surfaces are Lambertian, it can be assumed that anobserver is a pinhole camera. A real human eye is more accuratelymodeled as a finite aperture camera, which is an approach taken in other3D display view simulation work. However, a pinhole camera is used forsimplicity, as it can serve as an upper bound on quality over the finiteaperture case in some sense. It is assumed that the canonical imageforms the basis of the observer image. Thus, to examine the quality ofthe image, the canonical rays are considered. More specifically, it isassumed that the canonical image I_(c)[D,O] can be related to thecanonical image through a type of upscaling operation. The canonicalimage is a sampled version of a warped observer image. Applying theinverse warping function to the canonical image, a continuous version ofit, would then give the observer image. The warping function can also bedescribed as a projection function.

It is possible to present a formal definition of a 3D light fielddisplay in terms of various design parameters. Without loss ofgenerality, it is assumed that the display is centered in 3D space at(x,y,z)=(0,0,0) and points at an observer in the positive z direction,with y direction pointing up. The formal definition is as follows:

Considering a light field display D=(M_(x),M_(y),N_(u),N_(u),f,D_(LP)),where (M_(x), M_(y)) are the horizontal and vertical dimensions of thedisplay's spatial resolution, (N_(u), N_(v)) are the horizontal andvertical dimensions of the display's angular resolution components. Itis assumed the display is an array of idealized light field projectorswhose pitch is D_(LP), and with a focal length f. Suppose theM_(x)×M_(y) array of light field projectors can be indexed LFP_(ij) suchthat the first coordinate aligns with the x-axis and the second with they-axis. Thus, results the set of light fieldprojectors:{LFP_(ij)|1≤i≤M_(x), 1≤j≤M_(y)}. For any particular lightfield projector, one may address each of the individual N_(u)×N_(v)pixels by LF P_(ij)(u,v) for 1≤u≤N_(u) and 1≤u≤N_(v). Based on the focallength f of the display, one can compute an angular field of view,denoted as θ_(FOV).

It is known that a light field display can represent objects that arewithin a volumetric region defined by two separate viewing frusta,encompassing an area both behind and in front of the display surface.These two frusta are hereto referred to as the inner and outer frustumregions of a given display.

Defining an observer O=(X_(O), D_(O), f_(O)) as a pinhole camera imaginga display with focal length f_(O), with focal point located at X_(O) andpointing in direction D_(O), where D_(O) is a 3D vector. For observer O,this is known as the observer image and denoted as I_(O).

A particular observer, depending on its particular location anddirection/orientation images a different subset of the possible outputray directions projected by the light field projectors of a display.These rays may be defined more precisely:

Given a display D=(M_(x),M_(y),N_(u),N_(v),f,D_(LP)) and an observerO=(X_(O), D_(O), f_(O)). Defining the set of rays, one for each lightfield projector associated with D, defined by the line that connectsX_(O) and the center of each light field projector. Let X_(ij) ^(c)denote the center of LFP_(ij). One then defines the set of lines

$\left. {{\overset{\_}{\left\{ {X_{O}X_{ij}^{c}} \right.}{1 \leq i \leq M_{x}}},{1 \leq j \leq M_{y}}} \right\}$

as the set of canonical rays of observer O relative to display D. Thecanonical rays, it should be noted, only form a subset of the rays thatwould contribute to the observer image. It is straightforward to observethat the set of canonical rays of an observer is independent of theobserver's direction D_(O) and focal length f_(O). Thus, the set of allpossible observers at a particular location share the same set ofcanonical rays relative to a display.

For any canonical ray

$\overset{\_}{X_{O}X_{ij}^{c}}$

here is an angular (θ_(ij),φ_(ij)) pair that represents the sphericalcoordinates of the vector associated with

$\overset{\_}{X_{O}X_{ij}^{c}}$

Each of the N_(u)×N_(v) elements of LF P_(ij)(u,v) also has a sphericalcoordinate representation which may be written as (θ(u),φ(v)) as well asa spatial vector representation, denoted as

$\overset{\_}{P_{u,v}.}$

The canonical rays for a given display and observer may be seen tosample intensity values from the light field projected by the displayand its light field projectors. These intensity values may be observedto form a M_(x) by M_(y) image, further referred hereto as the canonicalimage relative to display D and observer O. This image is denoted asI_(c)[D,O](x,y).

Considering a display's field of view θ_(FOV), there is a minimumdistance that an observer must be located at in order to be able to seelight from all the light field projector-based pixels on the display.Generally, for a smaller FOV, this distance becomes larger and for alarge field of view the observer may come closer. This distance can bedetermined by trigonometry as:

$d_{O} = \frac{M_{x}D_{LP}}{2{\tan \left( \frac{\theta_{FOV}}{2} \right)}}$

Each light field projector represents a segment of a continuous smoothlight field. Given a display and an observer, each of the canonical rayssamples the light field using the intensities within its correspondinglight field projector array. That is given a canonical ray

$\overset{\_}{X_{O}X_{ij}^{c}},$

an intensity value is reconstructed based on a resampling operationperformed on the intensity values contained in the light field projectorimage, LFP_(ij). A common model employed here gives each ray implied bythe light field projector image a spot width. This spot width allows oneto describe a physical reconstruction of a light field, by giving eachprojector intensity value a physical angular spread.

To simplify the analysis, the point spread function (PSF) model isignored to some extent. Instead, it is assumed that the canonical rayssample the light field from a particular LFP_(ij) using a nearestneighbor interpolation scheme. Consider a ray from a canonical imagecorresponding to the intensity value I_(c)[D,O](i,j), for some (i,j). Wesupposed the ray vector X_(O)X_(ij) ^(c) can be represented usingspherical coordinates as (θ_(ij),ϕ_(ij)). Let

$\left( {u_{n}^{ij},v_{n}^{ij}} \right) = {{\arg \min}_{u,v}{\arccos \left( {\overset{\_}{X_{O}X_{ij}^{c}} \cdot \overset{\_}{P_{u,v}}} \right)}}$

The indices (u_(n),v_(n)) represent the light field projector pixelwhich has minimum angular distance from the sampling canonical ray.Thus, by this nearest neighbor interpolation:

I _(c)[D,O](i,j)=LFP _(ij)(u _(n) ^(ij) ,v _(n) ^(ij))

This reconstruction model allows for an initially simpler analysis andunderstanding of the sampling geometry.

Based on the depth of field, or DoF, of a display concept, the abilityof a 3D display to represent spatial resolution decreases as objectsmove beyond the maximum depth of field. Current light field ormulti-view displays appear to suffer from a small depth of field, due torelatively poor angular resolution and sampling density. It is obviousfrom viewing these displays that objects at any significant depth intothe screen become quite blurry.

Objects in a 2D display, while lacking the additional perceptual cues ofa 3D display, do not become unnaturally blurred at distance. In astandard 2D display, 3D objects that appear deep in a scene that arevirtually distant from the display surface degrade in a natural wayrelative to the maximum resolution of the display. That is, as an objectbecomes more distant from the 2D display, its projected area on the 2Ddisplay is less, thus the number of pixels representing this projectedarea decrease with the size of the area. This corresponds to how moredistant objects are projected to a smaller area on the retina (or animaging plane of a camera) and thus less detail can be resolved.However, in a 3D display with relatively low angular resolution, objectsat distance become blurry and are not represented at a resolutionproportional to their projected area on the display surface.

It is proposed to measure the effective spatial resolution of a 3Ddisplay at depth in terms of how it compares to a pseudo equivalent 2Ddisplay. A 3D display is used to mimic a 2D display in a way byconsidering how a 3D display presents a fronto-parallel plane located inthe display's inner frustum region. The plane's size increases withdepth in order to fill the entire width of the viewing frustum relativeto a given observer position. Let d_(p) denote the z coordinate of theplane. This plane is referred hereto as plane P_(C)(d_(p),O).

Consider an observer O located at (x_(O),z_(O)). Let the width of thedisplay D be ω. A plane is constructed such that for an observer at(x_(O),z_(O)), at whatever depth the plane is placed, its size is suchthat the plane projects onto each spatial pixel of the display. In otherwords, what is seen on the plane will take up the entire space of thedisplay's surface.

To calculate the width W for the constructed plane located at depthd_(p), a formula based on the geometry of similar triangles is used. Itshould be noted the positive z axis is pointing toward the observer fromthe display making d_(p) a negative value in:

$W = {\omega \frac{z_{O} - d_{p}}{z_{O}}}$

Analysis has been conducted using a 1D display, for simplicity. For 1Danalysis, a display will be defined as

D=(M _(x) ,N _(u) ,f,D _(LP))

With light field projectors which are addressable as LFP_(i)(u) forsuitable u in the defined range. The observer will be defined asO=(X_(O)) with X_(O) being an x and z coordinate only. We let X_(i) ^(C)denote the center of LFP_(i). The set of canonical rays of observer Orelative to display D are

{X_(O)X_(i)^(C)1 ≤ i ≤ M_(x)}

The canonical image produced is I_(C){D,O](x). A canonical ray

$\overset{\_}{X_{O}X_{i}^{C}}$

has an angular representation θ_(i). Each of the N_(u) elements ofLFP_(i)(u) has an angular representation θ(u) and a spatial vectorrepresentation which is denoted as P_(u) . For nearest neighborinterpolation

I _(C)[D,O](i)=LFP _(i)(u _(n) ^(i))

Where:

u _(n) ^(i)=argmin|θ_(i)−θ(u)|

Effective Resolution for Inner Frustum

To answer the question of how the resolution of a given observerdegrades for scene elements in terms of distance from the displaysurface, the analysis is restricted to depths located in the innerfrustum of the display. For simplicity, assumed as a 1-D display.

To quantify the effective resolution at depth, the key question toanswer in this setting is: How do light field projector rays thatcontribute to reconstructing the incident canonical rays sample theplane P_(C)(d_(p), O)? Modelled here are two sampling steps: (1) lightfield projector rays sample the plane and (2) incident canonical rayssample the light field from a subset of the light field projector rays.The problem is simplified by assuming that a canonical ray sample isconstructed using just one element of a light field projector throughnearest neighbor interpolation.

Theorem 1

Assume a display D=(M_(x),N_(u),f,D_(LP)) and an observerO=(X_(O),D_(O),f_(O)). Let z_(O)=ω=M_(x)D_(LP). Therefore, effectiveresolution at depth, d_(p), may be estimated by:

$P_{x} = \frac{\omega \frac{z_{o} - d_{p}}{z_{o}}}{D_{LP} + {{- d_{p}}\frac{2\tan \mspace{14mu} {\theta_{FOV}/2}}{N_{u}}}}$

Proof:

Assume a plane P(d_(p),O). Distances are defined related to how the raysof the light field projectors sample the inner frustum and thus theplane P(d_(p),O). Consider the set of M_(x) canonical rays C={c₁, . . ., c_(M) _(x) }, labeled such that c_(i) is the ray that intersectsLFP_(i). The intensity associated with the ray c_(i) then isI_(C)[D,O](i).

Based on the defined nearest neighbor scheme, each ray c_(i)∈C is mappedto a corresponding ray in LFP_(i), indexed by u_(n) ^(i) as definedpreviously. There are two possible cases. In the first case, twoadjacent canonical rays c_(i) and c_(i+1) both have nearest neighbors intheir corresponding light field projectors, (LFP₁,LFP_(i+1)) with thesame angle. That is, u_(n) ^(i+1)=u_(n) ^(i). Another way to look atthis is that adjacent canonical rays are mapped to parallel light fieldprojector rays. In the second possible case, two adjacent canonical raysc_(i) and c_(i+1) are mapped to distinct rays in their correspondinglight field projectors. That is, u_(n) ^(i+1)=u_(n) ^(i)+k, for integerk≥1. For displays where N<M_(x) and assuming that an observer stands atdistance at least d_(O) from the display surface, this case will beu_(n) ^(i+1)=u_(n) ^(i)+1.

Distances are now defined based on these two cases. For the first case,and two adjacent LFP rays are parallel with their distance being D_(LP)by definition. In the second case,

${d = {D_{LP} + q}}{q = {{- d_{p}}\frac{2{\tan \left( {\theta_{FOV}/2} \right)}}{N_{u}}}}$

This combination of parallel samples and diverging samples produces anon-uniform sampling pattern. It is proposed that the effectiveresolution of a display surface depth-observer trio is the size of theplane P_(C)(d_(p),O) divided by the largest sampling distance. This isbecause the largest sampling distance determines the smallest featuresize that is guaranteed to be sampled.

That is to say that a 2D display with resolution P_(X) will have thesame smallest feature size as a particular display-surfacedepth-observer trio. For the inner frustum, the P_(X) will be:

$P_{x} = {\frac{W}{d} = \frac{\omega \frac{z_{o} - d_{p}}{z_{o}}}{D_{LP} + {{- d_{p}}\frac{2{\tan \left( {\theta_{FOV}/2} \right)}}{N_{u}}}}}$

Given the formula for an estimate of effective resolution at depth, itis shown that the formula gives a curve with a minimum value in terms ofthe variable d_(p). If the plane is at a very large depth (i.e.d_(p)→−∞), the asymptotic minimum effective resolution is:

$B_{x} = {{\lim\limits_{d_{p}->{- \infty}}P_{x}} = \frac{{\omega N}_{u}}{z_{o}2{\tan \left( {{FOV}/2} \right)}}}$

Effective Resolution for Outer Frustum

For the outer frustum, the value of d_(p) will be positive based on thecurrent coordinate system. When d_(p) is positive, q becomes negativeand therefore d<D_(LP).

It is proposed that the effective resolution of a display surfacedepth-observer trio is the size of the plane divided by the largestsampling distance which is now D_(LP).

Disparity Encoding/Decoding

The encoded layered scene decomposition representation of a light fieldproduced from a sampling scheme applied to each layer is principallycomprised of a plurality of pixels including RGB color and disparity.Generally speaking, selecting an appropriate bit width for the disparity(depth) field of the pixel is important, as the width of this fieldimproves the accuracy of the operation during reconstruction. However,the use of an increased number of bits contributes negatively to thecompression rate achieved.

In the present disclosure, each layer of RGB color and disparity pixelsspecified by the given sampling scheme has a specific range of disparitycorresponding to the individual pixels. The present disclosure exploitsthis narrow range of disparity within each layered scene decompositionlayer to increase the accuracy of the depth information. In conventionalpixel representations, the range of disparity for an entire scene ismapped to a fixed number of values. For example, in 10-bit disparityencoding, there can only be 1024 distinct depth values. In the layeredscene decomposition of the present disclosure, the same fixed number ofvalues are applied to each layered scene decomposition layer, as eachlayer has known depth boundaries. This is advantageous as thetransmission bandwidth can be reduced by decreasing the width of thedepth channel, while maintaining pixel reconstruction accuracy. Forexample, when the system implements a disparity width of 8-bits and thescene is decomposed into 8 layered scene decomposition layers, a totalof 2048 distinct disparity values can be used, with each layer having256 distinct possible values based on 8-bit representation. This is moreefficient than mapping the entire range of possible disparity valueswithin the inner or outer frustum to a given number of bits.

The present disclosure utilizes the same number of bits, but the bitsare interpreted and distinctly represent disparity within each layeredscene decomposition layer. Since each layered scene decomposition layeris independent from each other, depth (bit) encoding can differ for eachlayer and can be designed to provide a more accurate fixed-pointrepresentation. For example, a layered scene decomposition layer closerto the display surface has smaller depth values and can use a fixedpoint format with a small number of integer bits and a large number offractional bits, while layered scene decomposition layers further awayfrom the display surface has larger depth values and can use a fixedpoint format with a large number of integer bits and a small number offractional bits. The fractional bits are configurable on a per layerbasis:

MinFixedPoint=1/(2^(FractionalBits))

MaxFixedPoint=2^(16-FractionalBits)−MinFixedPoint

Disparity is calculated from the depth in the light fieldpost-processing stage and encoded using the following formula:

ScaleFactor=(MaxFixedPoint−MinFixedPoint)/(NearClipDisparity−FarClipDisparity)

EncodedDisparity=(Disparity−FarClipDisparity)*ScaleFactor+MinFixedPoint

Disparity is decoded using the following formula:

ScaleFactor=(MaxFixedPoint−MinFixedPoint)/(NearClipDisparity−FarClipDisparity)

UnencodedDisparity=(EncodedDisparity−MinFixedPoint)/ScaleFactor+FarClipDisparity

Generalized and Illustrative Embodiment—CODEC Implementation andApplications Overview

The present disclosure defines an encoder-decoder for various types ofangular pixel parameterizations, such as, but not limited to, planarparameterizations, arbitrary display parameterizations, a combination ofparameterizations, or any other configuration or parameterization type.A generalized and illustrative embodiment of the present disclosureprovides a method to generate a synthetic light field formulti-dimensional video streaming, multi-dimensional interactive gaming,or other light field display scenarios. A rendering system and processesare provided that can drive a light field display with real-timeinteractive content. The light field display does not require long-termstorage of light fields, however, the light fields must be rendered andtransmitted at low latency to support an interactive user experience.

FIG. 7 provides a CODEC system overview of the generalized, illustrativeembodiment of the present invention. A gaming engine or interactivegraphics computer (70) transmits three-dimensional scene data to GPU(71). The GPU encodes the data and sends it over the display port (72)to a decoding unit (73) containing a decoding processor such as an FPGAor ASIC. The decoding unit (73) sends decoded data to a light fielddisplay (74).

FIG. 1 illustrates another generalized, exemplary layered scenedecomposition CODEC system, where light field data from a synthetic orvideo data source (50) is input to encoder (51). A GPU (43) encodes theinner frustum volume data, dividing it into a plurality of layers, andGPU (53) encodes the outer frustum volume data, dividing it into anadditional plurality of layers. While FIG. 1 illustrates separate GPUs(43, 53) dedicated for the inner and outer frustum volume layers, asingle GPU can be utilized to process both the inner and outer frustumvolume layers. Each of the layered scene decomposition layers aretransmitted to decoder (52), where the plurality of inner frustum volumelayers (44(1) through 44(*)) and the plurality of outer frustum volumelayers (54(1) through 54(*)) of a light field are decoded and mergedinto a single inner frustum volume layer (45) and a single outer frustumvolume layer (55). As per double frustum rendering, the inner and outerfrustum volumes are then synthesized (merged) into a single,reconstructed set of light field data (56), otherwise referred to hereinas a “final light field” or “display light field”.

FIGS. 10 to 13 illustrate exemplary CODEC process implementationsaccording to the present disclosure.

FIG. 10 illustrates an exemplary layered scene decomposition CODECmethod, whereby 3D scene data in the format of image description orlight field data is loaded to an encoder (400) for encoding, whereupondata (sub)sets as illustrated in the figure, or alternatively the entiredata set representing the 3D scene is partitioned (403). In the case ofthe identification of 3D scene data subsets for partitioning (402), itis understood that the identification process is a general process stepreference which is intended to simply refer to the ability to partitionthe data set in one pass, or in groupings (e.g. to encode inner frustumand outer frustum data layers as illustrated in more detail in FIG. 11),as may be desired according to the circumstances. In this regard, theidentification of data subsets may imply pre-encoding processing stepsor processing steps also forming part of the encoding sub-process stage(401). Data subsets may be tagged, specified, confirmed, scanned andeven compiled or grouped at the time of partitioning to produce a set oflayers (decomposition of the 3D scene) (403). Following the partitioningof data subsets (403), each data layer is sampled and rendered accordingthe present disclosure to produce compressed (image) data (404).Following data layer compression the compressed data is transmitted to adecoder (405) for the decoding sub-process (406) comprisingdecompression, decoding and re-composition steps to (re)construct a setof light fields (407), otherwise referred to herein as “layered lightfields”, layered light field images and light field layers. Theconstructed layered light fields are merged to produce the final lightfield (408) displaying the 3D scene (409).

An exemplary, parallel CODEC process is illustrated in FIG. 13 foroptimizing the delivery of a light field representing a 3D scene inreal-time (e.g. to minimize artifacts). The process comprises the stepsof loading 3D scene data to an encoder (700), encoding and compressingthe residue encoded representation (701) of the final light field,transmitting the residue encoded representation (702) to a decoder,decoding the residue encoded representation and using the residueencoded representation with the core encoded representation to producethe final light field (703) and display the 3D scene at a display (704).

FIG. 11 illustrates an embodiment related to the embodiment shown inFIG. 10 in that two data (sub)sets; the inner frustum layer (502), andthe outer frustum layer (503), that are derived based on the 3D scenedata (500) are identified for partitioning (501) and the partitioning ofeach data set into layers of differential depths is implementedaccording to two different layering schemes for each data set (504,505), i.e. equivalent to a plurality of data layers. Each set(plurality) of data layers (506, 507) representing an inner frustum andouter frustum volume of a light field display respectively aresubsequently sampled on a per layer basis according to sampling scheme(508, 509); and each sampled layer is rendered to compress the data andproduce two sets of compressed (image) data (510, 511) in process steps(508, 509), respectively. The sets of compressed data (510, 511)encoding the sets of light fields corresponding to the sets of datalayers (506, 507), are then combined (512) to produce a layered, coreencoded representation (513) (CER) of a final (display) light field.

FIG. 12 illustrates an embodiment of a CODEC method or process toreconstruct a set of light fields and produce a final light field at adisplay. The set of light fields (layered light fields) is(re)constructed from the core encoded representation (513) usingmulti-stage view synthesis protocols (600). A protocol (designated asVS1-VS8) is applied (601-608) to each of the eight layers of the coreencoded representation (513), which protocols may or may not bedifferent depending on characteristics of each data layer light field tobe decoded. Each protocol may apply a form of non-linear interpolationtermed herein as edge adaptive interpolation (609) to provide good imageresolution and sharpness in the set(s) of layered light fields (610)reconstructed from the core encoded representation of said fields ensureimage sharpness. The layered light fields (610) are merged, in this caseillustrating the merging of two sets of light fields (611, 612)corresponding to two data subsets to produce two sets of merged lightfields (613, 614). The merged sets of light fields (613, 614) mayrepresent, for example, the inner frustrum and outer frustum volumes ofa final light field and can be accordingly merged (615) to produce saidfinal light field (616) at a display.

CODEC Encoder/Encoding

Encoding according to the present disclosure is designed to support thegeneration of real-time interactive content (for example, for gaming orsimulation environments) as well as existing multi-dimensional datasetscaptured through light field generalized pinhole cameras or cameraarrays.

For a light field display D, a layered scene decomposition L, and asampling scheme S, the system encoder produces the elemental imagesassociated with the light fields corresponding to each layered scenedecomposition layer included in the sampling scheme. Each elementalimage corresponds to a generalized pinhole camera. The elemental imagesare sampled at the resolution specified by the sampling scheme and eachelemental image includes a depth map.

Achieving rendering performance to drive real-time interactive contentto multi-dimensional display with a significantly high resolution andsize presented significant challenges overcome with the application of ahybrid or combination rendering approach to resolve the deficiencies ofrelying solely on any one technique as described herein.

When given identity function α, the set of generalized pinhole camerasspecified by the encoding scheme for a given layered scene decompositionlayer can be systematically rendered using standard graphics viewportrendering. This rendering method results in a high number of draw calls,particularly for layered scene decomposition layers with samplingschemes including large numbers of the underlying elemental images.Therefore, in a system utilizing layered scene decomposition forrealistic, autostereoscopic light field displays, this rendering methodalone does not provide real-time performance.

A rendering technique utilizing standard graphics draw calls restrictsthe rendering of a generalized pinhole camera's planar parameterizations(identity function α) to perspective transformations. Hardware-optimizedrasterization functions provide the performance required forhigh-quality real-time rendering in traditional two-dimensionaldisplays. These accelerated hardware functions are based on planarparameterizations. Alternatively, parallel oblique projections canutilize standard rasterized graphics pipelines to render generalizedpinhole camera planar parameterizations.

The present disclosure contemplates the application of rasterization torender the generalized pinhole camera views by converting sets oftriangles into pixels on the display surface. When rendering largenumbers of views, every triangle must be rasterized in every view;oblique rendering reduces the number of rendering passes required foreach layered scene decomposition layer and can accommodate any arbitraryidentity function α. The system utilizes one parallel oblique projectionper angle specified by the identity function α. Once the data isrendered, the system executes a “slice and dice” block transform (seeU.S. Pat. Nos. 6,549,308 and 7,436,537) to re-group the stored data fromits by-angle grouping into an elemental image grouping. The “slice anddice” method alone is inefficient for real-time interactive contentrequiring many separate oblique rendering draw calls when a large numberof angles are to be rendered.

An arbitrary identity function α can also be accommodated by aray-tracing rendering system. In ray tracing, specifying arbitraryangles does not require higher performance than accepting planarparameterizations. However, for real-time interactive content requiringrendering systems utilizing the latest accelerated GPUs, rasterizationprovides more reliable performance scalability than ray tracingrendering systems.

The present disclosure provides several hybrid rendering approaches toefficiently encode a light field. In one embodiment, encoding schemesrender layered scene decomposition layers located closer to the displaysurface, with more images requiring less angular samples, and layerslocated further away from the display surface, with less images and moreangular samples. In a related embodiment, perspective rendering, obliquerendering, and ray tracing are combined to render layered scenedecomposition layers; these rendering techniques can be implemented in avariety of interleaved rendering methods.

According to the generalized, illustrative embodiment of the disclosure,one or more light fields are encoded by a GPU rendering an array oftwo-dimensional pinhole cameras. The rendered representation is createdby computing the pixels from the sampling scheme applied to each of thelayered scene decomposition layers. A pixel shader performs the encodingalgorithm. Typical GPUs are optimized to produce a maximum of 2 to 4pinhole camera views per scene in one transmission frame. The presentdisclosure requires rendering hundreds or thousands of pinhole cameraviews simultaneously, thus multiple rendering techniques are employed torender data more efficiently.

In one optimized approach, the generalized pinhole cameras in thelayered scene decomposition layers located further away from the displaysurface are rendered using standard graphics pipeline viewportoperations, known as perspective rendering. The generalized pinholecameras in the layered scene decomposition layers located closer to thedisplay surface are rendered using the “slice and dice” block transform.Combining these methods provides high efficiency rendering for layeredplenoptic sampling theory sampling schemes. The present disclosureprovides layered scene decomposition layers wherein layers locatedfurther away from the display surface contain a smaller number ofelemental images with a higher resolution and layers located closer tothe display surface contain a greater number of elemental images with alower resolution. Rendering the smaller number of elemental images inthe layers further away from the display surface with perspectiverendering is efficient, as the method requires only a single draw callfor each elemental image. However, at some point, perspective renderingbecomes or is inefficient for layers located closer to the displaysurface, as these layers contain a greater number of elemental images,requiring an increased number of draw calls. Since elemental imageslocated in layers located closer to the display surface correspond to arelatively small number of angles, oblique rendering can efficientlyrender these elemental images with a reduced number of draw calls. Inone embodiment a process to determine where the system should utilizeperspective rendering, oblique rendering, or ray tracing to render thelayered scene decomposition layers is provided. Applying a thresholdalgorithm, each layered scene decomposition layer is evaluated tocompare the number of elemental images to be rendered (i.e., the numberof perspective rendering draw calls) to the size of the elemental imagesrequired at the particular layer depth (i.e., the number of obliquerendering draw calls), and the system implements the rendering method(technique) requiring the least number of rendering draw calls.

Where standard graphics calls cannot be utilized, the system canimplement ray tracing instead of perspective or oblique rendering.Accordingly, in another embodiment, an alternative rendering methodrenders layers located closer to the display surface, or a portion ofthe layers located closer to the display surface, using ray tracing.

In ray-tracing rendering systems, each pixel in a layered scenedecomposition layer is associated with a light ray defined by the lightfield. Each ray is cast and the intersection with the layered scenedecomposition is computed as per standard ray tracing methodologies. Raytracing is advantageous when rendering an identity function α which doesnot adhere to the standard planar parameterizations expected by thestandard GPU rendering pipeline, as ray tracing can accommodate thearbitrary ray angles that are challenging for traditional GPU rendering.

When a hogel projects pixel into space not every pixel will be useful.Consider the top left hogel of a display projecting a pixel up and tothe left. The only time this pixel would be seen by an observer would beif the observer was in a location where the top left hogel is at thebottom right boundary of the observer's field of view. From thislocation, every other hogel in the display would be viewed from a largerangle than the field of view allows for and as a result every otherhogel would be turned off except the top left hogel. This observerlocation as specified is not in a useful viewing location and if the topleft pixel of the top left hogel were turned off it would beinconsequential. This discussion uses the concept of a valid viewingzone. The valid viewing zone is a set of all locations in space where anobserver can view every hogel on the display at an angle within thefield of view and as a result receives a pixel from each hogel. Thiszone will be where the projection frustum of every hogel intersects.

The definition of the valid viewing zone can be effectively slimmed downto where the projection frustum of the four corner hogels intersect. Thecorners are the most extreme cases so if a location is within theprojection frustum of the four corners the location is also within thevalid viewing zone. This approach also introduces the concept of amaximum viewing distance which is the constraint introduced in order torealize these savings and efficiencies. Without the maximum viewingdistance, the viewing frustum is a rectangular pyramid whose tip isoriented along the negative display normal and whose base is at aninfinite depth from the display (i.e. a standard frustum). Introducing amaximum viewing distance, the base of the rectangular pyramid now has abase whose distance is the same as the maximum viewing distance. Theapproach taken to realize savings is not rendering or sending pixelsthat will not be projected into the valid viewing zone and are thereforewasteful. The number of pixels that are needed for a specified maximumviewing distance is the hogel fill factor. The hogel fill factor is theratio between the viewing zone size and the hogel projection size at agiven depth (i.e. in 2D, if the hogel projection has a width of 1 m andthe viewing zone has a width of 0.5 m than only half the projectedpixels were needed).

DW represents the display width in meters, MVD is the minimum viewdistance (in meters) FOV is the field of view (in degrees). The maximumviewing distance is defined as MVD+y, where y represents the size ofusable range in meters. From similar geometry, angle b is equal to anglea, where angle b is equal to the field of view (in degrees). The widthof the viewing zone, labeled c, is defined by the equation:

$c = {{\tan \left( \frac{F0V}{2} \right)}*2*y}$

The width of the hogel projection is defined by the equation:

$e = {{\tan \left( \frac{F0V}{2} \right)}*2*\left( {y + {MVD}} \right)}$

The hogel fill factor in 2D is the ratio between c and e, therefore:

${{hogel}\mspace{14mu} {fill}\mspace{14mu} {factor}} = \frac{{\tan \left( \frac{F0V}{2} \right)}*2*y}{{\tan \left( \frac{F0V}{2} \right)}*2*\left( {y + {MVD}} \right)}$

Which reduces to:

${hogel}\mspace{14mu} {fill}\mspace{14mu} {factor}{= \frac{y}{\left( {y + {MVD}} \right)}}$

If this is applied in 3D, then the hogel fill factor is applied alongboth (x,y). As a result, the hogel fill factor is defined as:

${{hogel}{\mspace{11mu} \;}{fill}\mspace{14mu} {factor}} = \left( \frac{y}{y + {MVD}} \right)^{2}$

The result of increasing or decreasing the hogel fill factor is anincrease or decrease in maximum viewing depth respectively.

Ray Trace Pixels in Corrected Sample Pattern

A strategy to corrected light field is to rasterize a light field andthen apply a per pixel warp operation. Where the pixel is supposed to gois determined by a characterization routine that involves imaging thedisplay. How a light field is warped depends on an equation whose formdoes not change but has coefficients that do change. These coefficientsare unique to each display. The idea behind the correction (but notliterally how it works) is that if pixel was supposed to be at X butinstead was measured at X+0.1 the pixel would be warped to locationX−0.1 in anticipation that it will be measured at X. The goal is to havethe measured location match the intended location.

This strategy of generation on a uniform grid followed by warping to thecorrect grid could be replaced with sampling the correct gridimmediately using ray tracing. Rasterization is a uniform grid operationwhile ray tracing is generalized sampling. This would also help maintainthe light field integrity. Consider a white pixel in a sea of black. Thecorrection for a first lens system calls for both a horizontal andvertical shift of +0.5. The result is 4 gray pixels in a 2×2 gridsurrounding 0.5,0.5. The display lens calls for both a horizontal andvertical shift of −0.5. The result is a 3×3 grid of illuminated pixels,a bright gray pixel in the center, four dimmer gray pixels on the foursides, and four dimmer gray pixels on the four corners. This dispersionof energy wouldn't happen if the pixels were originally sampledcorrectly. It seems unlikely that ray tracing will be faster thanrasterization but the entire pipeline might be quicker if correction wascut out and only half the light field was capture as per the calculatedhogel fill factor.

Screen Space Ray Tracing

An alternative to the warping approach to view synthesis is screen spaceray tracing. McGuire et al. propose application of screen space raytracing to multiple depth layers (for robustness). These depth layersare those that are produced by depth peeling. However, the depth peelingalgorithm is slow, therefore when using modern GPUs, single-pass methodsare preferred, e.g., Mara et al., based around reverse reprojection,multiple viewport, and multiple rasterization.

There is potential to combine screen space ray tracing with layeredscene decomposition. Individual rays are traced based on the views thatare known. From this, the result is an image indicating the color ateach pixel. For a layered scene decomposition CODEC process, an encodedform of a light field is created and represented as layers with missingpixels. These pixels can be reconstructed using screen space ray tracingfrom the pixels present in the encoded representation. Thisrepresentation can be elemental images or layered elemental images inthe form of a deep G-buffer for example. In McGuire et al. it isdescribed one technique for doing this, using an acceleration datastructure for the layered depth image type representation. This standsin contrast to methods that trace rays at polygonal or object levelrepresentations which also are used effectively with data structuresthat accelerate ray intersection.

Many real-time rendering techniques operate in screen-space in order tobe computationally efficient, including techniques for approximatingrealistic lighting, such as, but not limited to, screen-space ambientocclusion, soft shadows, and camera effects such as depth of field.These screen-space techniques are approximate algorithms thattraditionally work by ray tracing 3D geometry. Many of these algorithmsuse screen-space ray tracing, or rather ray marching as described bySousa et al. Ray marching is desirable as no additional data structureneeds to be built. Classic ray marching methods, like DigitalDifferential Analyzer (DDA), are susceptible to over and under-sampling,unless perspective is accounted for. Most screen-space ray tracingmethods use only a single depth layer. Combination of this techniquewith the layered scene decomposition allows the algorithms to work on asubset of the scene, rather than through multiple depth layers, and dueto the optimized partitioning of the scene into subsets, ray hitdistance may be reduced.

The skilled technician in the field to which the invention pertains willappreciate that there are multiple rendering methods and combinations ofrendering methods that can successfully encode the layered scenedecomposition elemental images. Other rendering methods may provideefficiency in different contexts, dependent upon the system's underlyingcomputational architecture, the utilized sampling scheme, and theidentity function a of the light field display.

CODEC Decoder/Decoding

Decoding according to the present disclosure is designed to exploit theencoding strategy (sampling and rendering). The core representation as aset of layered light fields from a downsampled layered scenedecomposition is decoded to reconstruct the light fields LF^(O) andLF^(P). Consider a display D=(M_(x),M_(y),N_(v),f,α,D_(LP)) with alayered scene decomposition L=(K₁,K₂,L^(O),L^(P)) and an associatedsampling scheme S=(M_(s),R). The elemental images are decoded byreconstructing the light fields LF^(O) and LF^(P) from deconstructedLF^(O) and LF^(P) light fields downsampled as specified by samplingscheme S. The pixels align such that the inner and outer frustum volumelayers located closer to the display surface are reviewed first, movingto inner and outer frustum volume layers located further away from thedisplay surface until a non-empty pixel is located, and the data fromthe non-empty pixel is transmitted to the empty pixel closer to thedisplay surface. In an alternative embodiment, particularimplementations may restrict viewing to the inner frustum volume or theouter frustum volume of the light field display, thereby requiring thedecoding of one of LF^(O) or LF^(P).

In one embodiment, a decoding process is represented by the followingpseudocode:

Core Layered Decoding:

for each l_(i)ϵL^(O):

ReconLF(LF_(l) _(i) ,D_(m)[LF_(l) _(i) ],S)

LF^(o)=LF_(l) _(i) _(*m)LF_(l) _(i−1) // or LF_(l) _(i) _(*m)LF_(l)_(i+1) (front-back vs. back-front)

A similar procedure reconstructs LF^(P). Each layered scenedecomposition layer is reconstructed from the limited samples defined bythe given sampling scheme S. Each of the inner frustum volume layers orthe outer frustum volume layers are merged to reproduce LF^(o) orLF^(P).

ReconLF can be executed in various forms with varying computational andpost-CODEC image quality properties. ReconLF may be defined as afunction, such that, given a light field associated with a layer thathas been sampling according to given sampling scheme S, and thecorresponding depth map for the light field, it reconstructs the fulllight field that has been sampled. The ReconLF input is the subset ofLF_(l) _(i) data defined by the given sampling scheme S and thecorresponding downsampled depth map D_(m)[LF_(l) _(i) ]. Depth-ImageBased Rendering (DIBR), as described by Graziosi et al., can reconstructthe input light field. DIBR can be classified as a projection renderingmethod. In contrast to re-projection techniques, ray-casting methods,such as the screen space ray casting taught by Widmer et al., canreconstruct the light fields. Ray casting enables greater flexibilitythan re-projection but increases computational resource requirements.

In the DIBR approach, elemental images specified in the sampling schemeS are used as reference “views'” to synthesize the missing elementalimages from the light field. As described by Vincent Jantet in “LayeredDepth Images for Multi-View Coding” and by Graziosi et al., when thesystem uses DIBR reconstruction, the process typically includes forwardwarping, merging, and back projection.

Application of the back-projection technique avoids producing cracks andsampling artifacts in synthesized views such as elemental images.Back-projection assumes that the elemental image's depth map ordisparity map is synthesized along with the necessary reference imagesrequired to reconstruct the target image; such synthesis usually occursthrough a forward warping process. With the disparity value for eachpixel in the target image, the system warps the pixel to a correspondinglocation in a reference image; typically, this reference image locationis not aligned on the integer pixel grid, so a value from theneighboring pixel values must be interpolated. Implementations of backprojection known in the art use simple linear interpolation. Linearinterpolation, however, can be problematic. If the warped referenceimage location sits on or near an object edge boundary, the interpolatedvalue can exhibit significant artifacts, as information from across theedge boundary is included in the interpolation operation. Thesynthesized image is generated with a “smeared′” or blurred edge.

The present disclosure provides a back-projection technique for theinterpolation substep, producing a high-quality synthesized imagewithout smeared or blurred edges. The present disclosure introducesedge-adaptive interpolation (EAI), where the system incorporates depthmap information to identify the pixels required by the interpolationoperation to calculate the colour of the warped pixels in a referenceimage. EAI is a nonlinear interpolation procedure that adapts andpreserves edges during low-pass filtering operations. Consider a displayD=(M_(x),M_(y),N_(u),N_(v),f,α,D_(LP)) with a target image I_(t)(x,y), areference image I_(r)(x,y), and depth maps D_(m)(I_(t)) andD_(m)(I_(r)). The present disclosure utilizes the depth map D_(m)(I_(t))pinhole camera parameters (f, α, etc.) and the relative position of thedisplay's array of planar-parameterized pinhole projectors to warp eachI_(t) pixel integer (x,y,) to a real-number position (x_(w),y_(w)) inI_(r). In the likely scenario where (x_(w),y_(w)) is not located on aninteger coordinate position, a value must be reconstructed based onI_(r) integer samples.

Linear interpolation methods known in the art reconstructI_(r)(x_(w),y_(w)) from the four nearest integer coordinates located ina 2×2 pixel neighborhood. Alternate reconstruction methods use largerneighborhoods (such as 3×3 pixel neighborhoods), generating similarresults with varying reconstruction quality (see Marschner et al., “Anevaluation of reconstruction filters for volume rendering”). Theselinear interpolation methods have no knowledge of the underlyinggeometry of the signal. The smeared or blurred edge images occur whenthe reconstruction utilizes pixel neighbors belonging to differentobjects, separated by an edge in the images. The erroneous inclusion ofcolour from other objects creates ghosting artifacts. The presentdisclosure remedies this reconstruction issue by providing a method toweigh or omit pixel neighbors by using the depth map D_(m)(I_(r)) topredict the existence of edges created when a plurality of objectsoverlap.

FIG. 3A illustrates textures (80,83), where a sampling location,illustrated as a black dot (86), is back-projected into another imagebeing reconstructed. The sampling location (86) is located near theboundary of a dark object (87) with a white background (88). In a firstreconstruction matrix (81), the full 2×2 pixel neighborhood, each singlewhite pixel represented by a square (89), reconstructs the samplinglocation (86) value using a known technique such as linearinterpolation. This results in a non-white pixel (82), as the darkobject (87) is included in the reconstruction. The second reconstructionmatrix (84) uses the EAI technique of the present disclosure,reconstructing the sampling location (86) from the three neighboringsingle white pixels (90). EAI detects the object edge and omits the darkpixel (87), resulting in the correct white pixel reconstruction (85).

For a fixed, arbitrary coordinate (x_(r),y_(r)) in the target imageI_(t)(x,y), d_(t) defines the location depth:

d _(t) =D _(m)[I _(r)(x _(r) ,y _(r))]

The target image coordinate (x_(r),y_(r)) warps to the reference imagecoordinate (x_(w),y_(w)).

For an m-sized neighborhood of points close to (x_(w),y_(w)), the setN_(S)={(x_(i),y_(i))|1≤i≤m}. The weight for each of the neighbors isdefined as:

w _(i) =f(d _(t) ,D _(m)[I _(r)](x _(i) y _(i))]

Where w_(i) is a function of the depth (x_(r),y_(r)) and the depth ofthe neighbor of (x_(w),y_(w)) corresponding to index i. The followingequation represents an effective w_(i) for a given threshold t_(e):

$w_{i} = \left\{ \begin{matrix}\left. 1 \middle| {d_{t} - {{D_{m}\left\lbrack I_{r} \right\rbrack}\left( {x_{i},y_{i}} \right)}} \middle| {< t_{e}} \right. \\\left. 0 \middle| {d_{t} - {{D_{m}\left\lbrack I_{r} \right\rbrack}\left( {x_{i},y_{i}} \right)}} \middle| {\geq t_{e}} \right.\end{matrix} \right.$

The threshold t_(e) is a feature size parameter. The weight functiondetermines how to reconstruct I_(r) (x_(r),y_(r))

I _(r)(x _(r) ,y _(r))=Recon(w ₁ I _(r)(x ₁ ,y ₁),(w ₂ I _(r)(x ₂ ,y ₂),. . . (w _(m) I _(r)(x _(m) ,y _(m)))

The Recon function can be a simple modified linear interpolation, wherethe w_(i) weights are incorporated with standard weighting proceduresand re-normalized to maintain a total weight of 1.

The present disclosure also provides a performance-optimized decodingmethod for reconstructing the layered scene decomposition. Consider adisplay D=(M_(x),M_(y),N_(u),N_(v),f,a,D_(LP)) with a layered scenedecomposition L=(K₁,K₂,L^(O),L^(P)) and an associated sampling schemeS=(M_(s),R). The elemental images are decoded by reconstructing thelight fields LF^(O) and LF^(P) from deconstructed LF^(O) and LF^(P)light fields downsampled as specified by sampling scheme S. As notedabove, particular implementations may restrict viewing to the innerfrustum volume or the outer frustum volume of the light field display,thereby requiring the decoding of one of LF^(O) or LF^(P).

LF^(O) can be reconstructed by decoding the elemental images specifiedby sampling scheme S. The ReconLF method for particular layers does notinclude inherent constraints regarding the order that the missing pixelsof the missing elemental images are to be reconstructed. It is an objectof the present disclosure to reconstruct missing pixels using a methodthat maximizes throughput; a light field large enough for an effectivelight field display requires an exceptional amount of data throughput toprovide content at an interactive frame rate, therefore improvedreconstruction data transmission is required.

FIG. 3B illustrates a general process flow for reconstructing a pixelarray. Reconstruction starts (30), followed by implementing a samplingscheme (31). Pixels are then synthesized by column (32) in the array andalso synthesized by row (33), which can be done in either order. Oncethe synthesis of all pixels has been done by column and row the pixelreconstruction is complete (34).

The present disclosure introduces a basic set of constraints to improvepixel reconstruction with improved data transmission for content at aninteractive frame rate. Consider a single light field L_(i)ϵL_(o)containing M_(x)×M_(y) elemental images, as input to ReconLF. The pixels(in other words, the elemental images) are reconstructed in two basicpasses. Each pass operates in separate dimensions of the array ofelemental images; the system executes the first pass as a columndecoding, and the second pass as a row decoding, to reconstruct each ofthe pixels. While the present disclosure describes a system employingcolumn decoding followed by row decoding, this is not meant to limit thescope and spirit of the invention, as a system employing row decodingfollowed by column decoding can also be utilized.

In the first pass, the elemental images specified by sampling scheme Sare used as reference pixels to fill in missing pixels. FIG. 4illustrates the elemental images in the matrix as B, or blue pixels(60). The missing pixels (61) are synthesized strictly from referencepixels in the same column. FIG. 5 illustrates schematically acolumn-wise reconstruction of a pixel matrix, as part of the image(pixel) reconstruction process showing column-wise reconstruction (63)of red pixels (62) and blue pixels (60). These newly synthesizedcolumn-wise pixels are shown as R, or red pixels (62) in FIG. 5 next toblue pixels (60) and missing pixels (61). Newly reconstructed pixelswritten to a buffer and act as further pixel references for the secondpass, which reconstructs pixels reference pixels located in the same rowas other elemental images. FIG. 6 illustrates a subsequent row-wisereconstruction (64) of the pixel matrix, as part of the image (pixel)reconstruction process alongside the column-wise reconstruction (63).These newly synthesized row-wise pixels are shown as G, or green pixels(65) next to blue pixels (60) and red pixels (62).

In one embodiment a process for reconstructing a pixel array isrepresented by the following pseudocode algorithm:

-   -   Dimensional Decomposition Light Field Reconstruction:    -   Pass 1:    -   for each row of elemental images in L_(i)        -   for each missing elemental image in the row        -   for each row in elemental image            -   load (cache) pixels from same row in reference images        -   for each pixel in missing row            -   reconstruct pixel from reference information and write    -   Pass 2:    -   for each column of elemental images in L₁        -   for each missing elemental image in the column        -   for each column in elemental image            -   load (cache) reference pixels from same column        -   for each pixel in missing column            -   reconstruct pixel from reference information and write

This performance-optimized decoding method allows the row-decoding andcolumn-decoding constraints to limit the effective working data setrequired for reconstruction operations.

To reconstruct a single row of a missing elemental image, the systemonly requires the corresponding row of pixels from the referenceelemental images. Likewise, to reconstruct a single column of a missingelemental image, the system only requires the corresponding column ofpixels from the reference elemental images. This method requires asmaller dataset, as decoding methods previously known in the art requireentire elemental images for decoding.

Even when decoding relatively large elemental image sizes, the reduceddataset can be stored in a buffer while rows and columns of missingelemental images are being reconstructed, thereby providing improveddata transmission.

Once all the rendered data has been decoded, and each of the pluralityof inner and outer display volume layers are reconstructed, the layersare merged into a single inner display volume layer and a single outerdisplay volume layer. The layered scene decomposition layers can bepartially decompressed in a staged decompression or can be fullydecompressed simultaneously. Algorithmically, the layered scenedecomposition layers can be decompressed through a front-to-back orback-to-front process. The final double frustum merging process combinesthe inner and outer display volume layers to create the final lightfield for the light field display.

Use of Computational Neural Network with Layered Scene Decomposition

Martin discusses deep learning in light fields presenting thatConvolutional Neural Networks (CNNs). Research on deep learning towardsother light field problems is also ongoing as it is an increasinglypopular approach is to train networks end to end, implying that thenetwork learns all aspects of problem at hand. For example, in viewsynthesis, this avoids using computer vision techniques, such asappearance flow, image inpainting, and depth image-based rendering tomodel certain parts of the network. Martin presents the framework for apipeline concept for view synthesis for light field volume rendering.This pipeline has the ability to be implemented with the layered scenedecomposition method disclosed herein.

Layered scene decomposition achieves multi-dimensional scenedecomposition into layers, or subsets and elemental images, orsubsections. Machine learning is emerging as a learning-based method ofview synthesis. A layered scene decomposition provides a method ofdownsampling of a light field following its decomposition into layers.Previously, this was considered in the context of opaque surfacerendering with Lambertian shaded surfaces. What is desired is a methodof downsampling light fields, as previous, but can be applied withhigher-order lighting models, including semi semi-transparent surfaces,for example, direct volume rendering based lighting models. Volumerendering techniques include but are not limited to direct volumerendering (DVR), texture-based volume rendering, volumetric lighting,two-pass volume rendering with shadows, or procedural rendering.

Direct volume rendering is rendering process which: maps from a volume(e.g. voxel-bases sampling of a scalar field) data-set to a renderedimage without intermediary geometry (no isosurface). Generally, thescalar field defined by the data is considered as a semi-transparent,light emitting medium. A transfer function specifies how the field ismapped to opacity and color and a ray-casting procedure then accumulatesthe local color, opacity along paths from a camera and through thevolume.

Levoy (1988) first presented that direct volume rendering methodsgenerate images of a 3D volumetric data set without explicitlyextracting geometric surfaces from the data. Kniss et al. present thatthough a data set is interpreted as a continuous function in space, forpractical purposes it is represented by a uniform 3D array of samples.In graphics memory, volume data is stored as a stack of 2D textureslices or as a single 3D texture object. The term voxel denotes anindividual “volume element,” similar to the terms pixel for “pictureelement” and texel for “texture element.” Each voxel corresponds to alocation in data space and has one or more data values associated withit. Values at intermediate locations are obtained by interpolating dataat neighboring volume elements. This process is known as reconstructionand plays an important role in volume rendering and processingapplications.

The role of an optical model is to describe how light interacts withparticles within the volume. More complex models account for lightscattering effects by considering illumination (local) and volumetricshadows. Optical parameters are specified by the data values directly,or they are computed from applying one or more transfer functions to thedata to classify particular features in the data. Martin implementsvolume rendering using volume data sets and provides a depth buffer toassign a depth value for each individual pixel location. The depthbuffer, or z-buffer is converted to a pixel disparity and the depthbuffer value, Z_(b)Z_(b), is converted into normalized coordinates inthe range [−1, 1], as Z_(C)Z_(c)=2·Z_(b)Z_(b)−1. Then the perspectiveprojection is inverted to give depth in eye space, Z_(e), as:

$Z_{e} = \frac{2 \cdot Z_{n} \cdot Z_{f}}{Z_{n} + Z_{f} - {Z_{c}\left( {Z_{f} - Z_{n}} \right)}}$

Where Z_(n) is the depth of the camera's near plane and Z_(f) is thedepth of the far plane in eye space. Wanner et al. presents that Z_(n)should be set as high as possible to improve depth buffer accuracy,while Z_(f) has little effect on the accuracy. Given eye depth Z_(e), itcan be converted to a disparity value dr in real units by the use ofsimilar triangles as:

${dr} = {\frac{B \cdot f}{Z_{e}} - {\Delta x}}$

Where B is the distance between two neighbouring cameras in the grid, fis the focal length of the camera, and Δx is the distance between twoneighbouring cameras' principle points. Using similar triangles, thedisparity in real units can be converted to a disparity in pixels as:

${dp} = \frac{drWp}{Wr}$

Where dp and dr denote the disparity in pixels and real-world unitsrespectively, Wp is the image width in pixels, and Wr is the imagesensor width in real units. If the image sensor width in real units isunknown, Wr can be computed from the camera field of view θ and focallength f as:

${Wr} = {2 \cdot {{f\tan}\left( \frac{\theta}{2} \right)}}$

View synthesis may also be formulated through warping. While warping isa simple method to synthesize new views, it can produce visual artifactsthat can degrade the visual quality of the warped image. The most commonof these artifacts are disocclusions, cracks and ghosting.

Disocclusion artifacts or “occlusion holes” occur when a foregroundobject is warped and the reference views do not contain the data for thebackground pixels that are now in view. Occlusion holes can be fixed byinpainting the hole with available background information or filling thehole with actual data captured by extra references or residueinformation.

Warp cracks occur when warping a surface and two pixels that areadjacent in the reference view are warped to the new view and are nowlonger adjacent but are separated by a small number of pixels. Roundingerrors can cause warp cracks because the newly calculated pixelcoordinates have to be truncated to integer image coordinates, which cancause an adjacent pixel to round differently. Sampling frequency cancause warp cracks by trying to warp a surface into an orientation thatincreases its pixel count, i.e. plane slanted to a camera and thenviewed perpendicular. The new view will want to display pixels that werebeyond the sampling frequency of the reference camera, leading to cracksin the new image.

Ghosting can occur during the backward warping interpolation phase. Herethe back projected pixel neighborhood contains pixels from both abackground and foreground object. Pixels from the foreground can bleedcolor information into the background which can cause a “halo” orghosting effect. These usually occur around an occlusion holes and bleedforeground color into the background.

One of the main problems with forward warping is that the warped imagecan contain warp cracks which degrade the visual quality. Depth mapsgenerated with forward warping can easily be fixed by merging multipleviews warped from different references or by applying a crack filter.Filters, like a median filter, can effectively remove cracks because adepth map is a very low frequency image, containing mostly subtlegradients or edges around objects. These simple filters will not work oncolor images due to the complexity of the object textures. A method toremove warp cracks in the color image is to use backward warping. Inbackward warping, you first forward warp the depth image to get a depthmap of the new view. After filtering the cracked depth map, you use thefiltered depth to warp back to a reference image. To prevent cracks fromoccurring, you do not round the pixel coordinates, but instead select apixel neighborhood and use the real pixel weighting to interpolate thecorrect color. This backward warped color image is now free of cracks. Aside effect of the interpolation phase is the possibility to introduceghosting artifacts.

Interactive direct volume rendering is required for interactive viewingof time-varying 4D volume data, as progressive rendering may not workwell for that particular use case as presented by Martin. Example usecases for interactive direct volume rendering include but are notlimited to the rendering of static voxel-based data without artifactsduring rotation, rendering of time-varying voxel-based (e.g. 4D MM orUltrasound, CFD, wave, meteorological, visual effects (OpenVDB) andother physical simulations, etc.) data.

A proposed solution includes the use of machine learning to learn how to“warp” volumetric scene views, potentially constrained to a particulartransfer function, which communicates how to map density of thedifferent materials to color and then its level of transparency. For afixed transfer function, a Computational Neural Network could be trainedvery well with a modest sized data set to allow a used to define adecoder that only works for volume data and only works for a particulartransfer function. The potential results is a hardware system, or adecoding system, that when the desired transfer function is selected, itwould change the decoder slightly from a different training data set inorder to be able to decode the data has been given.

A proposed method disclosed herein is suited for the rendering of 4Dvolume data. Using current hardware and hardware techniques, it is verydifficult to brute force render light fields of volume data. What isproposed is generating a layered scene decomposition of volume datawhich is to be rendered and decoded using a decoder. Decoding the datais effectively filling in missing pixels or elemental images. Aconvolutional neural network may be trained to solve smaller versions ofthe problem, using the system employing column decoding followed by rowdecoding, in addition, a system employing row decoding followed bycolumn decoding may also be utilized. Martin teaches that in order toperform fast but accurate image warping using a disparity map, a form ofbackward warping with bilinear interpolation is to be implemented. Theestimated disparity map for the central view is used an an estimate forall views. Pixels in the novel view that should read data from alocation that falls outside the border of the reference view are set toread the closest border pixel in the reference view instead.Essentially, this stretches the border of the reference view in thenovel view, rather than producing holes. Since warped pixels rarely fallat integer positions, bilinear interpolation is applied to accumulateinformation from the four nearest pixels in the reference view. Thisresults in fast warping with no holes, and good accuracy. Martin furtherdiscloses a method of training a neural network to apply this correctionfunction. An improvement on this is to teach a neural network compatiblewith layered scene decomposition. This could be further expanded toapply a convolutional neural network for each layer within a layeredscene decomposition that was specifically trained for that layer, eachdepth. A neural network would then be set to perform row reconstructionwhile another neural network would be set to perform columnreconstruction.

Based on the selected criteria, a light field display simulator could beused to train the neural network. A light field display simulatorprovides a high-performance method of exploring the parameterization ofsimulated virtual three-dimensional light field displays. This methoduses a canonical image generation method as part of its computationalprocess to simulate a virtual observer's view(s) of a simulated lightfield display. The canonical image method provides for a robust, fastand versatile method for generating a simulated light field display andthe light field content displayed thereon.

Higher Order Lighting Models

In computer graphics, the color of an opaque dielectric can be modeledwith Lambertian reflectance; the color is considered constant withrespect to the viewing angle. This correlates with the standard methodsof color measurement used in industry which are based on the sameLambertian (or near Lambertian) reflectance.

The fundamental concept of layered scene decomposition is the ability topartition a scene into sets and subsets with the ability to thenreconstruct these partitions for the formation of a light field. Thisconcept is based upon the capability to warp pixels to reconstructmissing elemental images in a layer by way of warping. In furtherdetail, the light intensity of a particular point in one image is mappedinto another image in a slightly different pixel position, based on thegeometric shift that results when an image in camera is shifted left toright. The assumption that the warping method as described herein can beused to accurately reconstruct the missing pixels within layers is thatthe pixels mapping from one image to the next have the same color, asthey would with a Lambertian lighting model.

Light fields, especially when restricted under the assumption of aLambertian lighting model, have significant redundancy. It may beobserved that each elemental image differs very slightly fromneighboring images. This redundancy is described in the literature underplenoptic sampling theory. The Lambertian lighting model is sufficientfor useful graphics, but not overly realistic. To capture gloss, haze,and goniochromatic color of an object, alternate models areinvestigated, including, but not limited to, the specular exponent ofthe Phong model, the surface roughness of the Ward model, and thesurface roughness of the Cook-Torrance model. Gloss may be defined as ameasure of the magnitude of the specular reflection, and haze may bedefined as the parameter which captures the width of the specular lobe.

In order to utilize alternate lighting models with the view synthesisaspect of layered scene decomposition methods we are disclosing, it isproposed that shading can be applied as a post-process to reconstructedpixels. This post-process occurs after warping process (or other viewsynthesis reconstruction) has occurred. The surface normal informationrelative to a light position, or point, may be known and included withencoded light field data. This encoded list of light positions, to allowthe decoder to use that normal data when its decoding a particular pixelin a layer to compute a specular component. Other parameters could beincluded with the light field data, such as, the property of whether asurface has a specular component or not, or to what degree it is couldbe quantified in a number. Material properties could also be potentiallyincluded along with intensity values. This additional data could be sentin combination with the typical RGB and depth data that is sent with theencoded form within each elemental image, or each layer. Materialproperties may include, but are not limited to atomic, chemical,mechanical, thermal and optical properties.

The concept of storing surface normals in combination with RGB and depthinformation is known as a G-Buffer in computer graphics.

FIG. 14 illustrates an exemplary layered scene decomposition CODECmethod, whereby 3D scene data in the format of image description orlight field data is loaded to an encoder (400) for encoding, whereupondata (sub)sets as illustrated in the figure, or alternatively the entiredata set representing the 3D scene is partitioned (403). In the case ofthe identification of 3D scene data subsets for partitioning (402), itis understood that the identification process is a general process stepreference which is intended to simply refer to the ability to partitionthe data set in one pass, or in groupings (e.g. to encode inner frustumand outer frustum data layers as illustrated in more detail in FIG. 11),as may be desired according to the circumstances. In this regard, theidentification of data subsets may imply pre-encoding processing stepsor processing steps also forming part of the encoding sub-process stage(401). Data subsets may be tagged, specified, confirmed, scanned andeven compiled or grouped at the time of partitioning to produce a set oflayers (decomposition of the 3D scene) (403). Following the partitioningof data subsets (403), each data layer is sampled and rendered accordingthe present disclosure to produce compressed (image) data (404).Following data layer compression, the compressed data is transmitted toa decoder (405) for the decoding sub-process comprising decompression,decoding and re-composition steps to (re)construct a set of light fields(407), otherwise referred to herein as “layered light fields”, layeredlight field images and light field layers. Specular lighting iscalculated (411) and the constructed layered light fields are merged toproduce the final light field (408) displaying the 3D scene (409).

FIG. 15 illustrates a computer-implemented method comprising:

receiving a first data set comprising a three-dimensional description ofa scene (420);

the first data set comprises information on directions of normals onsurfaces included in the scene (421);

the directions of the normals are represented with respect to areference direction (422); and

optionally the reflection properties of at least some of the surfacesare non-Lambertian;

partitioning the first data set into a plurality of layers eachrepresenting a portion of the scene at a location with respect to areference location (423); and

encoding multiple layers to generate a second data set, wherein a sizeof the second data set is smaller than a size of the first data set(424).

In an embodiment, the method further comprises:

receiving the second data set (425);

reconstructing the portions associated with a layer using the directionsof normals on surfaces included in the scene for calculation of aspecular component (426);

combining the reconstructed portions into a light field (427); and

presenting the light field image on a display device (428).

To gain a better understanding of the invention described herein, thefollowing examples are set forth with reference to the Figures. It willbe understood that these examples are intended to describe illustrativeembodiments of the invention and are not intended to limit the scope ofthe invention in any way.

EXAMPLES Example 1: Exemplary Encoder and Encoding Method for a LightField Display

The following illustrative embodiment of the invention is not intendedto limit the scope of the invention as described and claimed herein, asthe invention can successfully implement a plurality of systemparameters. As described above, a conventional display as previouslyknown in the art consists of spatial pixels substantially evenly-spacedand organized in a two-dimensional row, allowing for an idealizeduniform sampling. By contrast, a three-dimensional (3D) display requiresboth spatial and angular samples. While the spatial sampling of atypical three-dimensional display remains uniform, the angular samplescannot necessarily be considered uniform in terms of the display'sfootprint in angular space.

In the illustrative embodiment, a plurality of light fieldplanar-parameterized pinhole projectors provide angular samples, alsoknown as directional components of the light field. The light fielddisplay is designed for a 640×480 spatial resolution and a 512×512angular resolution. The plurality of planar-parameterized pinholeprojectors are idealized with identity function α. The pitch betweeneach of the plurality of planar-parameterized pinhole projectors is 1mm, thereby defining a 640 mm x 400 mm display surface. The display hasa 120° FOV, corresponding to an approximate focal length f=289 μm.

This light field display contains 640×480×512×512=80.5 billion RGBpixels. Each RGB pixel requires 8 bits, therefore one frame of the lightfield display requires 80.5 billion'8×3=1.93 Tb. For a light fielddisplay providing interactive content, data is driven at 30 frames/s,requiring a bandwidth of 1.93 Tb×30 frames/s=58.0 Tb/s. Current displaysknown in the art are driven by DisplayPort technology providing maximumbandwidths of 32.4 Gb/s, therefore such displays would require over 1024DisplayPort cables to provide the tremendous bandwidth required byinteractive light field displays, resulting in cost and form-factordesign constraints.

The illustrative embodiment delivers data to a light field display froma computer equipped with an accelerated GPU with dual DisplayPort 1.3cables output. We consider a conservative maximum throughput of 40 Gb/s.The encoded frames must be small enough for transmission over theDisplayPort connection to a decoding unit physically located closer tothe light field display.

The layered scene decomposition of the illustrative embodiment isdesigned to allow the required data throughput. With the dimensionsdefined above, the maximum depth of field of the light field display isZ_(DOF)=(289 microns)(512)=147968 microns=147.986 mm. The layered scenedecomposition places a plurality of layered scene decomposition layerswithin the depth of field region of the light field display, ensuringthat the distance of the layered scene decomposition layers from thedisplay surface is less than Z_(DOF). This illustrative exampledescribes a light field display with objects located only within theinner frustum volumes of the display. This illustrative example is notintended to limit the scope of the invention, as the invention cansuccessfully implement a plurality of system parameters, such as a lightfield display with objects located only within the outer frustum volumeof the display, or a light field display with objects located withinboth the inner and outer frustum volumes of the display; embodimentslimited to one frustum volume require a smaller number of layered scenedecomposition layers, thereby marginally decreasing the size of theencoded light field to be produced.

The illustrative embodiment defines ten layered scene decompositionlayers. When necessary, additional layered scene decomposition layerscan be added to capture data that could be lost to occlusions, or toincrease the overall compression rate. However, additional layered scenedecomposition layers require additional computation from the decoder,thus the number of layered scene decomposition layers is carefullychosen. The illustrative embodiment specifies the ten layered scenedecomposition layers from their front and back boundaries and assumesthat the dividing boundaries of the layer are parallel to the displaysurface.

Each layered scene decomposition layer is located at a defined distancefrom the display surface, where the distances are specified in terms ofmultiples of focal length f, up to the maximum depth of field of 512f.Layered scene decomposition layers with a more-narrow width areconcentrated closer to the display surface, and the layer width (i.e.,the depth difference between the front and back layer boundaries)increases exponentially by powers of 2 as the distance from the displaysurface increases. This embodiment of the invention is not intended tolimit the scope of the invention, as other layer configurations can beimplemented successfully.

The following table (Table 1) describes the layered scene decompositionlayer configurations of the illustrative embodiment, and provides asampling scheme based on plenoptic sampling theory to create sub-sampledlayered scene decomposition layers:

TABLE 1 Maximum distance between Total data size sampled required (24Elemental elemental Elemental bit color, 8 bits Front Back image imagesimages for depth/ Layer boundary boundary resolution (sampling gap)sampled disparity) 0   1f   1f   1 × 1 0 640 × 480   7.37 Mbits 1   1f  2f   2 × 2 1 321 × 241   9.90 Mbits 2   2f   4f   4 × 4 2 214 × 161 17.64 Mbits 3   4f   8f   8 × 8 4 161 × 97  31.98 Mbits 4   8f  16f  16× 16 8  72 × 55  32.44 Mbits 5  16f  32f  32 × 32 16  41 × 31  41.65Mbits 6  32f  64f  64 × 64 32  21 × 16  44.04 Mbits 7  64f 128f 128 ×128 64  11 × 9  51.90 Mbits 8 128f 256f 256 × 256 128   6 × 5  62.91Mbits 9 256f 512f 512 × 512 256   4 × 3 100.66 Mbits Total: 400.49 Mbits

In the above table, layer 0 captures images that are to be displayed atthe display surface, as in a conventional two-dimensional display knownin the art. Layer 0 contains 640×480 pixels at a fixed depth, so it doesnot require any depth information. The total data size is calculated foreach pixel with an RGB value and a depth value for 8 bits each(alternate embodiments may require larger bit values, such as 16 bits).In the illustrative embodiment, the elemental image resolution andsampling gap are calculated from the formulas described above, and thesampling scheme chosen reflects the elemental image resolution andsampling gap restrictions.

As described in the above table, the combined layered scenedecomposition system has a total size of 400.5 Mb. Therefore, to producedata at a rate of 30 frames/s, a bandwidth of 30×0.4005=12.01 GB/s isrequired. This encoded form is sent over the dual DisplayPort 1.3cables, along with additional information required to represent sceneocclusions.

In the illustrative embodiment, the layered scene decomposition layersare configured by an encoder, efficiently implementing an obliquerendering technique to produce the layers located closer to the displaysurface (layers 0 to 5) and a perspective rendering technique to producethe layers located further away from the display surface (layers 6 to9). Each elemental image corresponds to a single rendering view.

At layer 6, the number of separate angles to be rendered (64×64=4096)exceeds the number of views to be rendered (21×16=336); this signals thetransition in efficiency between the oblique and perspective renderingmethods. It should be noted that specific implementation aspects mayprovide additional overhead that skews the exact optimal transitionpoint. For use with modern graphics acceleration techniques known in theart, perspective rendering can be efficiently implemented using geometryshader instancing. Multiple views are rendered from the same set ofinput scene geometry without repeatedly accessing the geometry throughdraw calls and without repeatedly accessing memory to retrieve the exactsame data.

FIG. 8 illustrates the illustrative embodiment, with ten layered scenedecomposition layers (100-109) in the inner frustum volume (110). Theinner frustum volume layers extend from the display surface (300). Thelayers are defined as described in the table above, for example, thefront boundaries of the inner frustum volume layer 0 (100) is 1f, innerfrustum volume layer 1 (101) is 1f, inner frustum volume layer 2 (102)is 2f, inner frustum volume layer 3 (103) is 4f, and so on. Innerfrustum volume layers (100-105) 0 to 5, or layers closest to the displaysurface (300), are rendered with the oblique rendering technique, andinner frustum volume layers (106-109), 6 to 9 furthest from the displaysurface are rendered with the perspective rendering technique.

FIG. 9 illustrates an alternate embodiment, with ten layered scenedecomposition layers (100-109) in the inner frustum volume (110) and tenlayered scene decomposition layers (200-209) in the outer frustum volume(210). The inner and outer frustum volume layers extend from the displaysurface (300). While the inner and outer frustum volume layers areillustrated as mirror images from each other, the inner and outerfrustum volume may have differing numbers of layers, layers of differentsizes, or layers of different depths. Inner frustum volume layers 0 to 5(100-105) and outer frustum volume layers 0 to 5 (200-205) are renderedwith the oblique rendering technique, and inner frustum volume layers 6to 9 (106-109) and outer frustum volume layers 6 to 9 (206-209) fartherfrom the display surface (300) are rendered with the perspectiverendering technique.

An alternate embodiment can implement the system with a ray-tracingencoding based approach. Rendering a complete layered scenedecomposition layer representation can require increased GPUperformance, even with the optimizations described herein, as GPUs areoptimized for interactive graphics on conventional two-dimensionaldisplays where accelerated rendering of single views is desirable. Thecomputational cost of the ray-tracing approach is a direct function ofthe number of pixels the system is to render. While the layered scenedecomposition layer system contains a comparable number of pixels tosome two-dimensional single view systems, the form and arrangement ofsaid pixels differs greatly due to layer decomposition and correspondingsampling schemes. Therefore, there may be implementations where tracingsome or all of the rays is a more efficient implementation.

Example 2: CODEC Decoder and Decoding Method for a Light Field Display

In the illustrative embodiment of the invention, the decoder receivesthe 12.01 GB/s of encoded core representation data, plus any residuerepresentation data, from the GPU over dual DisplayPort 1.3 cables. Thecompressed core representation data is decoded using a customized FPGA,ASIC, or other integrated circuit to implement efficient decoding(residue representation data is decoded separately, as illustrated inFIG. 13). The 12.01 GB/s core representation is decompressed to 58 Tb/sfor the final light field display. Note that this core representationdoes not include the residue representations necessary to renderocclusions. The

$\frac{58\mspace{14mu} {{Tb}/s}}{1201\mspace{14mu} {{GB}/s}}$

provides a compression ratio of 4833:1; while this is a high performancecompression ratio, the reconstructed light field data may still exhibitocclusion-based artifacts unless residue representation data is includedin the reconstruction.

For the illustrative embodiment shown in FIG. 8, data is decoded byreconstructing individual layered scene decomposition layers and mergingthe reconstructed layers into an inner frustum volume layer. For analternate embodiment, such as illustrated in FIG. 9, the data is decodedby reconstructing individual layered scene decomposition layers andmerging the reconstructed layers into an inner frustum volume layer andan outer frustum volume layer.

A single layered scene decomposition layer can be reconstructed fromgiven sampling scheme sampling of data using view synthesis techniquesfrom the field of Image-Based Rendering which are known in the art. Forexample, Graziosi et al. specify using reference elemental images toreconstruct the light field in a single pass. This method uses referenceelemental images offset from the reconstructed image in multipledimensions. Because the elemental image data represents threedimensional scene points (including RGB color and disparity), pixels aredecoded as a nonlinear function (although fixed on the directionalvector between the reference and target elemental images), thereforerequiring a storage buffer of equal size to the decoding referenceelemental images. When decoding larger elemental images, this can creatememory storage or bandwidth constraints, depending on the decodinghardware.

For a light field display with an elemental image size of 512×512 pixelswith 24-bit color, a decoder requires a buffer capable of storing512×512=262,144 24-bit values (without disparity bits in this example).Current high-performance FPGA devices provide internal block memory(BRAM) organized as 18/20-bit wide memory and 1024 memory locationswhich can be used as a 36/40-bit wide memory with 512 memory locations.A buffer capable of reading and writing an image in the same clock cycleis large enough to hold two reference elemental images, as the nonlineardecoding process causes the write port to use a non-deterministic accesspattern. Implementing this buffer in an FPGA device for a 512×512 pixelimage requires 1024 BRAM blocks. Depending on the reconstructionalgorithm used, multiple buffers may be required in each decoderpipeline. To meet the data rate of a high-density light field display,the system may require more than one hundred parallel pipelines, whichis magnitudes more pipelines than current FPGA devices. Because eachbuffer requires an independent read/write port, it may not be possibleto implement such a system on current ASIC devices.

The present disclosure circumvents buffer and memory limitations bydividing the pixel reconstruction process into multiple,single-dimension stages. The present disclosure implements onedimensional reconstruction to fix the directional vector between thereference elemental images and the target to a rectified path. Whilereconstruction remains nonlinear, the reference pixel to be translatedto the target location is locked to the same row or column location ofthe target pixel. Therefore, decoder buffers only need to capture onerow or one column at a time. For the elemental image of 512×512 pixelswith 24-bit color described above, the decoder buffer is organized as a24-bit wide, 1024 deep memory requiring two 36/40×512 BRAM. Therefore,the present disclosure has reduced the memory footprint by a factor of512, or multiple magnitudes. This allows a display pixel fill raterequiring over a hundred decoding pipelines to be supported by currentFPGA devices.

Multi-stage decoding architectures require two stages to reconstruct thetwo dimensional pixel array in a light field display. The two stages areorthogonal to one another and reconstruct rows or columns of elementalimages. The first decoding stage may require a pixel scheduler to ensurethat output pixels ordered to be compatible with the next stage inputpixels. Due to the extremely high bandwidth required by each decodingstage, some output pixels from a previous stage may need to be reused toreduce local storage, cache, requirements. In this case, an externalbuffer can be used to capture all of the output pixels from a firststage so the subsequent decoding stage can efficiently access pixeldata, reducing logic resources and memory bandwidth.

The present disclosure's multi-stage decoding with an external memorybuffer allows the decoding process to transfer the required memorybandwidth from expensive on-die memory to lower cost memory devices suchas double data rate (DDR) memory devices. A high-performance decodingpixel scheduler ensures maximum reference pixel reuse from this externalmemory buffer, allowing the system to use narrower or slower memoryinterfaces.

The disclosures of all patents, patent applications, publications anddatabase entries referenced in this specification are herebyspecifically incorporated by reference in their entirety to the sameextent as if each such individual patent, patent application,publication and database entry were specifically and individuallyindicated to be incorporated by reference.

Although the invention has been described with reference to certainspecific embodiments, various modifications thereof will be apparent tothose skilled in the art without departing from the spirit and scope ofthe invention. All such modifications as would be apparent to oneskilled in the art are intended to be included within the scope of thefollowing claims.

REFERENCE LIST

-   ALPASLAN, ZAHIR Y., EL-GHOROURY, HUSSEIN S., CAI, JINGBO.    “Parametric Characterization of Perceived Light Field Display    Resolution”. pages 1241-1245, 2016.-   BALOGH, TIBOR, KOVACS. The Holovizio system-New opportunity offered    by 3D displays. Proceedings of the TMCE, (May):1-11, 2008.-   BANKS, MARTIN S., DAVID M. HOFFMAN, JOOHWAN KIM AND GORDON    WETZSTEIN. “3D Displays” Annual Review of Vision Science. 2016. pp.    397-435.-   CHAI, JIN-XIANG, XIN TONG, SHING-CHOW CHAN, AND HEUNG-YEUNG SHUM.    “Plenoptic Sampling”-   CHEN, A., WU M., ZHANG Y., LI N., LU J., GAO S., and YU J.. 2018.    “Deep Surface Light Fields”. Proc. ACM Comput. Graph. Interact.    Tech. 1, 1, Article 14 (July 2018), 17 pages.    DOI:https://doi.org/10.1145/3203192-   CLARK, JAMES J., MATTHEW R. PALMER AND PETER D. LAWRENCE. “A    Transformation Method for the Reconstruction of Functions from    Nonuniformly Spaced Samples” IEEE Transactions on Acoustics, Speech,    and Signal Processing. October 1985. pp 1151-1165. Vol. ASSP-33, No.    4.-   DO, MINH N., DAVY MARCHAND-MAILLET AND MARTIN VETTERLI. “On the    Bandwidth of the Plenoptic Function” IEEE Transactions on Image    Processing. pp. 1-9.-   DODGSON, N. A. Analysis of the viewing zone of the Cambridge    autostereoscopic display. Applied optics, 35(10):1705-10, 1996.-   DODGSON, N. A. Analysis of the viewing zone of multiview    autostereoscopic displays. Electronic Imaging 2002. International    Society for Optics and Photonics, pages pp 254-265, 2002.-   GORTLER, STEVEN J., RADEK GRZESZCZUK, RICHARD SZELISKI, AND MICHAEL    F COHEN. “The Lumigraph” 43-52.-   GRAZIOSI, D. B., APLASLAN, Z. Y., EL-GHOROURY, H. S., Compression    for Full-Parallax Light Field Displays. Proc. SPIE 9011,    Stereoscopic Displays and Applications XXV, (MARCH), 90111A. 2014.-   GRAZIOSI, D. B., APLASLAN, Z. Y., EL-GHOROURY, H. S., Depth Assisted    Compression of Full Parallax Light Fields. Proc. SPIE 9391,    Stereoscopic Displays and Applications XXVI, (FEBRUARY), 93910Y.    2015.-   HALLE, MICHAEL W. AND ADAM B. KROPP. “Fast Computer Graphics    Rendering for Full Parallax Spatial Displays”. Proc. SPIE 3011,    Practical Holography XI and Holographic Materials III, (10 Apr.    1997).-   HALLE, MICHAEL W. Multiple Viewpoint Rendering. In Proceedings of    the 25th annual conference on Computer graphics and interactive    techniques (SIGGRAPH '98). Association for Computing Machinery, New    York, N.Y., USA, 243-254.-   JANTET, VINCENT. “Layered Depth Images for Multi-View Coding”    Multimedia. pp. 1-135. Universite Rennes 1, 2012. English.-   LANMAN, D., WETZSTEIN, G., HIRSCH, M., AND RASKAR, R., Depth of    Field Analysis for Multilayer Automultiscopic Displays. Journal of    Physics: Conference Series, 415(1):012036, 2013.-   LEVOY, MARC, AND PAT HANRAHAN. “Light Field Rendering” SIGGRAPH. pp.    1-12.-   MAARS, A., WATSON, B., HEALEY, C. G., Real-Time View Independent    Rasterization for Multi-View Rendering. Eurographic Proceedings, The    Eurographics Association. 2017.-   MARSCHNER, STEPHEN R. AND RICHARD J. LOBB. “An Evaluation of    Reconstruction Filters for Volume Rendering” IEEE Visualization    Conference 1994.-   MASIA, B., WETZSTEIN, G., ALIAGA, C., RASKAR, R., GUTIERREZ, D.,    Display adaptive 3D content remapping. Computers and Graphics    (Pergamon), 37(8):983-996, 2013.-   MATSUBARA, R., ALPASLAN, ZAHIR Y., EL-GHOROURY, HUSSEIN S., Light    Field Display Simulation for Light Field Quality Assessment.    Stereoscopic Displays and Applications XXVI, 7(9391):93910G, 2015.-   PIAO, YAN, AND XIAOYUAN YAN. “Sub-sampling Elemental Images for    Integral Imaging Compression” IEEE. pp. 1164-1168. 2010.-   VETRO, ANTHONY, THOMAS WIEGAND, AND GARY J. SULLIVAN. “Overview of    the Stereo and Multiview Video Coding Extensions of the H.264/MPEG-4    AVC Standard.” Proceedings of the IEEE. pp. 626-642. April 2011.    Vol. 99, No. 4.-   WETZSTEIN, G., HIRSCH, M., Tensor Displays: Compressive Light Field    Synthesis using Multilayer Displays with Directional Backlighting.    1920.-   WIDMER, S., D. PAJAK, A. SCHULZ, K. PULLI, J. KAUTZ, M. GOESELE,    AND D. LUEBKE. An Adaptive Acceleration Structure for Screen-space    Ray Tracing. Proceedings of the 7th Conference on High-Performance    Graphics, HPG′15, 2015.-   ZWICKER, M., W. MATUSIK, F. DURAND, H. PFISTER. “Antialiasing for    Automultiscopic 3D Displays” Eurographics Symposium on Rendering.    2006.

1. A computer-implemented method comprising: receiving a first data setcomprising a three-dimensional description of a scene, the first dataset comprising information on directions of normals on surfaces in thescene with respect to a reference direction; partitioning the first dataset into a plurality of layers, each layer representing a portion of thescene at a location with respect to a reference location; encoding theplurality of layers to generate a second data set, wherein the seconddata set is smaller in size than the first data set; calculatingspecular lighting by reconstructing the plurality of layers using thedirections of normals on surfaces included in the scene to provideconstructed layered light fields; merging the constructed layered lightfields to produce a final light field image; and displaying thethree-dimensional scene.
 2. The method of claim 1, wherein calculatingthe specular lighting by reconstructing the plurality of layers is doneusing a multi-stage view synthesis reconstruction.
 3. The method ofclaim 2, wherein the view synthesis reconstruction is done using awarping process, screen space ray tracing, machine learning, image-basedrendering, or a combination thereof.
 4. The method of claim 1, whereinat least some of the surfaces in the scene have non-Lambertianreflection properties.
 5. The method of claim 1, wherein the constructedlayered light fields represents inner frustrum and outer frustum volumesof the final light field image.
 6. The method of claim 1, wherein thesurface normal information relative to a light position, or point, maybe known and included with encoded light field data.
 7. The method ofclaim 1, further comprising storing the normals on surfaces incombination with RGB and depth information.
 8. The method of claim 1,wherein the method captures at least one of gloss, haze, andgoniochromatic color of an object in the scene.
 9. The method of claim1, further comprising applying shading as a post-process toreconstructed pixels in the constructed layered light fields.
 10. Themethod of claim 1, wherein encoding the plurality of layers comprisesperforming a sampling operation on at least a portion of the first dataset to generate the second data set.
 11. The method of claim 10, whereinperforming the sampling operation is based on a target compression rateassociated with the second data set.
 12. The method of claim 10, whereinperforming the sampling operation comprises selecting multiple elementalimages from a plurality of elemental images in accordance with aplenoptic sampling scheme.
 13. The method of claim 10, whereinperforming the sampling operation comprises: determining an effectivespatial resolution associated with each layer; and selecting multipleelemental images from a plurality of elemental images in accordance witha determined angular resolution.
 14. The method of claim 13, wherein theangular resolution is determined as a function of a directionalresolution associated with the portion of the scene associated with eachlayer.
 15. The method of claim 13, wherein the angular resolution isdetermined as a field of view associated with a display device.
 16. Themethod of claim 1, wherein encoding the plurality of layers comprises:rendering using ray tracing, a set of pixels to be encoded; selectingmultiple elemental images from a plurality of elemental images such thatthe set of pixels are rendered using the selected multiple elementalimages; and sampling the set of pixels using a sampling operation. 17.The method of claim 1, wherein the three-dimensional descriptioncomprises light field data representing a plurality of elemental images.18. The method of claim 17, wherein the light field data includes adepth map corresponding to the elemental images.
 19. The method of claim17, wherein each of the plurality of elemental images is captured by oneor more image acquisition devices.
 20. The method of claim 1, furthercomprising: receiving user-input indicative of a location of a user withrespect to the final light field; and updating the final light field inaccordance with the user-input prior to displaying the three-dimensionalscene.
 21. The method of claim 1, wherein the information on thedirections of normals is stored in a geometry buffer.
 22. The method ofclaim 21, wherein the geometry buffer stores color and depthinformation.
 23. The method of claim 1, wherein partitioning the firstdata set into a plurality of layers comprises restricting a depth rangeof each layer.
 24. The method of claim 1, wherein layers in theplurality of layers located closer to the display surface are narrowerin width than layers located farther away from the display surface. 25.The method of claim 1, wherein partitioning the first data set into aplurality of layers maintains a uniform compression rate across thescene.
 26. The method of claim 1, wherein the method is used to generatea synthetic light field for multi-dimensional video streaming,multi-dimensional interactive gaming, or real-time interactive content.27. The method of claim 26, wherein the synthetic light field isgenerated only in a valid viewing zone.